Joe Cheng @ Posit | You have to be able to reason about it | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

So happy Thursday, everybody. Welcome to the Data Science Hangout. Hope everyone's having a great week. If we haven't met before, I'm Rachel Dempsey. I lead our pro community at Posit. If this is your first time ever joining us at a Data Science Hangout, welcome. This is our open space to chat about data science leadership, questions you're facing, and getting to hear about what's going on in the world of data across many different industries and companies.

And so we're here every Thursday, except for July, at the same time, same place. So if you're watching this recording on YouTube later, you can also add these events to your own calendar using the link below. But together, we're all dedicated to making this a welcoming environment for everyone. So we'd love to hear from everybody, no matter your level of experience, the area of work, or industry. It is also totally okay to just listen in and just hang out here with us.

There's also three ways you can jump in and either ask questions or provide your own perspective on certain topics too. So you can always jump in by raising your hand on Zoom and I'll keep an eye out. You can put questions in the Zoom chat. And feel free to just put a little star next to it if you want me to read it out loud instead, if you're maybe in a busy coffee shop or something. And then third, we also have a Slido link where you can ask questions anonymously.

But I am so excited to have my colleague, Joe Cheng , joining us here today as our featured leader. Joe is the CTO and first employee at Posit, well back then RStudio , where he helped create the RStudio IDE and Shiny web framework along with countless complimentary tools and packages. And Joe, I'd love to get things kicked off here by having you also introduce yourself and just kind of share a little bit about your role and what it means to be CTO at Posit.

Yeah, thanks for having me. CTO is actually an honorary title. There are a lot of different kinds of CTOs in the industry. And one of them is basically an engineer that joined very early and it's an honorary title. And that's what it certainly it started out as for me. Started out as purely an acknowledgement that there are strategic conversations that the board and the executive team would like me to be part of. But my day-to-day was still 100% software engineer and then team lead.

And these days, I think probably 90 plus percent of my time is as a team lead on a Shiny team. It's a team of about 11. And every once in a while will do more CTO stuff, which is having conversations that are more sort of at the strategic level with the other execs to talk about the direction that we're taking the company in. So for my day-to-day, I try to do, you know, 50% coding and 50% leading the team. That usually is more like 15% coding and 85% leading the team. There's always emergencies and, you know, last-minute conversations that come up.

So I do a lot of code reviews. I do a ton of sort of helping to decide feature directions. I do some talking with customers. I do some support. Just really anything that comes up. And we try to run our team that way for the most part. Most of our roles are relatively loose and flexible. So we really value having everyone on the team, or certainly all the software engineers, to be really thoughtful about what kind of features would be interesting to add to Shiny, to be able to design their own features to implement and to be responsible for the quality and support of their own features.

So yes, for me, it's Shiny all the time. Shiny day and night. Shiny for R. Shiny for Python . Packages that complement Shiny. HTML widgets. For the last month, I've been working on a sort of the equivalent to the DT package for R, which is a way to view data tables, data frames, and data tables very quickly with a rich HTML interface. I've been working on that for Shiny for Python, where we have not had that capability in the past.

But yeah, for sure, most of my time is in meetings and helping other team members achieve what they need to.

Thanks, Joe. And I forgot to ask you this, but I usually ask everybody, what's something that you like to do outside of work too? For me, that's cycling. I don't do it nearly as much as I'd like to, but that's my favorite way to blow off some steam after a long day.

And the thing that scared me 15 years ago was, what if all of my career contributions through luck or happenstance add up to not very much? So in our little corner of the world that is data science, Shiny has had a bigger impact than I was ever hoping for out of my career.

On being vulnerable on stage

Thank you, Joe. And I saw it and thank you for the question, Libby. I see Lisa said, please answer this without making us tear up again like you did at comp. And it was just reminding me like I love listening to you present Joe and how you tell stories and how you can be vulnerable and put emotion into your presentations. And I was just wondering, how do you learn that? How did you learn how to get up on stage and make us feel all those emotions that we all did at our studio conference?

You know, it's funny, I can tell you exactly how it happened. I mean, I'd given a lot of presentations before that one, but that was the first one that I was like, what if I made this like 90% about the technology, but 10%, you know, share something that that cost me a little bit.

And it was actually because I was, I attended local church. And there was this pastor who got up, he and his wife, and they, in front of all these people and being live streamed, talked about how he had cheated on her in the first year of their marriage. And I mean, it had been, I don't know, 20, 30 years, but still like that the two of them would get up and talk about, like the most painful thing, the most painful betrayal that he ever inflicted on her that they would both get up and talk about it. I was like, holy crap, like, that, the courage that it takes to do that. And I like, I just got so much value out of that, because who shares that kind of stuff in, you know, live in person.

So I'm not trying to compare me saying, oh, I was feeling a little bit bad about my career to that. But I was thinking, like, this energy is so, like, it was so meaningful, you know, like, it was such a beautiful moment. And there was just no, like, pure head insight that could compare. And it so like, it made me feel close to them that they were sharing something that was so close to their, you know, I don't know, so vulnerable.

So yeah, I brought it up with my speaking coach. She was a little bit like, wait, why are you gonna do this? And I was like, I don't know, like, I just really love this community. And I feel like, you know, whenever we like the best parts of this community are not, it's not about the technical exchange. It's when we do have those opportunities to get real with each other. Like when people talk about imposter syndrome, from the stage, it, those are some of the most powerful moments. And I felt like, like, what if I did that from a keynote, because that story is true, you know, like that, that, that was not something I had to conjure up. I think about that moment all the time.

So yeah, I thought, oh, and actually, I have to give credit to Jessie Mostopak. I told that story to her. And she was like, get out of town. Like, as soon as you said, like, that was supposed to be my last day at our studio, I was, I was like, tell me the rest of this story. So she really, she, I have to give her I think the most of the credit for recognizing that that story was one of those, it could be one of those moments.

Debugging Shiny apps

Dan, I see you, you put a question there about how you're starting to develop in Shiny coming from Python. Do you want to jump in here with that? Yeah, sure. Sure, Rachel. Hi, Joe. I am a, you know, more of a Python developer, although I started years and years ago, many blue moons ago with R and really fell in love with it, and kind of coming back to it a little bit and trying to develop with Shiny. One frustration I have is just around debugging. And when you've got, you know, all these panes and windows and windows and UI widgets, can you share your thoughts around or best practices, tips, whatever, on the best way to kind of go through the debugging process?

Yeah, absolutely. And can you just clarify, when you say debugging, you mean in a more general sense, not just using an interactive debugger? Yeah, correct. Like, I do everything in RStudio at the moment. And I run the app, all of a sudden, it just opens up and then closes. Or I run something, it spools an error. And, you know, it's not always obvious where the error actually is. It could be in the output side, it could be on the server side. And just maybe the pattern for debugging in RShiny is just so different from what I'm used to.

That could be it too. Yeah, it absolutely is. Well, especially if your work on Python, was it around, you know, like notebooks and scripts? Yeah, you got it. Yeah. Jupyter notebooks, scripts, that kind of thing. Yeah, absolutely. So I think for those of you who did not use Shiny, especially, when you are doing sort of normal analysis with a notebook or a script on your desktop, your code generally executes from beginning to end, right? Like you might have some functions, but those functions are called when you call them. And you have loops, but they loop when you loop. It's very transparent, like where the interpreter is at any given moment.

With Shiny, and almost every other form of interactive framework, you provide code to the framework, and the framework decides when it executes. So it makes it much more, much less intuitive to debug, because like, okay, this code is executing. Why is it executing? Like, how did it get here? It's much less transparent when code is executing, why it's executing. So there's a couple things that I'll tell you from a debugging perspective.

Number one, I think that Hadley's book, Mastering Shiny, has a chapter on this. And we definitely have at least a couple of video resources where we've given talks about this. So I'd highly recommend those. But I have a couple of sort of principles that I always go back to as well.

Number one, and maybe this less applies to you if you're coming from Python, but stack traces are very important. So when you... Often when you get an error in Shiny, you get a big bunch of lines spit out at the console that say, you know, from this file, this function name, from this file, this function name, this file, this function name. And they often are like multiple lines. And often people find that scary, because a lot of the function names that are in there are function names that they might not recognize, because it's not code that they wrote that's executing. It's code that's inside of a package that's inside of a package that's, you know, calling base R, you know.

So it's very important to not be afraid of that stack trace. It is very important to take the time to learn how to read those and understand exactly what's going on. And what the stack trace is telling you is that at the moment that this error occurred, what were the functions in order that had been called? So if you think about, you know, at any given moment when code is executing for R, that code is probably in a function. And that function was called by another function. And that call was called by another function. And so on and so forth. And it could be 20 levels deep. That is called your call stack. And when an error occurs, one of the most useful things to know is what was the call stack. And when that's printed to the console, now it's called a stack trace.

So you can take that stack trace and you can read up it and just ignore the lines that you don't recognize and find the line that you do recognize that says this was the line of code in your app.R file. It was line 49. That was the thing that was executing when this error happened. And I think that, you know, for the easy cases where, let's say, I don't know, you use the wrong function name or you indexed into a data frame using an invalid variable or something like that, just understanding the stack trace, that will be enough for you to fix the problem.

So that's one. The second thing you can do that's super useful is, especially in Shiny, never forget that there is a JavaScript console also that might be printing errors. And this is, like, I feel really bad about this. We should do things in Shiny to surface this more prominently. But when you're in your browser looking at your Shiny app, if things are not behaving the way you expect and you don't see any errors in the R console, you can show the JavaScript console in your browser. And if you see a bunch of errors there, that's a big clue for you as well.

Often that means there's some kind of bug in a component that you're using, whether it's in Shiny or some kind of third-party component. But, you know, if you're raising a GitHub issue or whatever, that's some of the most useful information you can provide. The third, and I'll move on after this, is to so Shiny does demand that you write your code in a certain way. We have reactive outputs and reactive expressions, and you have to put your code in a certain place to get it to execute at the right time. That doesn't mean that your sort of data science logic, that your data analysis, your data manipulation, and your visualization, that logic doesn't have to live there. You can write functions that live off to the side. They can even live in a separate function. They can even live in a package if you want.

So you can write functions that are not Shiny. They're just R functions that perform the tasks that you want, and you call them from Shiny. You call them from your Shiny outputs. The advantage of doing it this way is that if something goes wrong, you can test these pieces in isolation from the console. So you can make sure that each of these functions that you create, that do your data manipulation, that do your data visualization, you can ensure that they are working correctly. And then your Shiny app collapses to just like a very small number of lines of code, and then it's much easier to reason about when things are being called and why.

You have to be able to reason about it

Yeah, the browser and breakpoints are super helpful when you need them, but if you, like on my team, I'm constantly using the phrase, can you reason about it? I mean, they must be so sick of me saying that, but like when you write code, it is not enough that it works. If it's important for it to be right, it's not enough that it works. You have to be able to reason about it, and I find like browser is the most helpful when you have lost the ability to reason about your code. So definitely use it, but also think to yourself, like if this was really the only way for me to figure out what was going on, maybe it's time to refactor my code a little bit also.

To go back to the question right before this, there was a follow-up question about, like, could you say a little bit more about what you mean by can you reason about it?

Yeah. It's my favorite topic. Yeah. So I think just to sort of reiterate, when you write software or do an analysis or whatever, there's a level of I got it to work. And then there's the level of, like, it works and I can reason about it. And what that means is, like, software, complex pieces of software are among the most complicated things that humankind has ever devised. You know, like, what other human-made construct can have, you know, hundreds of millions of pieces and yet they're expected to all fit and work together so precisely that if one token is off, you know, rockets explode and, you know, you get the wrong answers.

When you write software or do an analysis or whatever, there's a level of I got it to work. And then there's the level of, like, it works and I can reason about it.

And I don't know what the limit is, but, like, you know, even the smartest humans, there is some small number of variables you can hold in your head. There's a small number of operations that you can hold in your head at any one time. So when we work on software that's non-trivial, when we work on software that in its totality is more than the human mind, more than any one human mind can hold at any one given time, the main challenge in software engineering is about how do you take all this complexity and break it down into smaller pieces, each of which you can reason about, each of which you can hold in your head, each of which you can look at and be like, yeah, I can fully ingest this entire function definition.

I can read it, you know, line by line and prove to myself this is definitely correct if the functions that it's calling don't have bugs and if it's called in the right way. So with those caveats, like, if the things that this thing is calling are correct and I'm called correctly, then the result will be correct because the logic here is correct.

So software engineering at all but the most beginner level is a lot about this. How do you break up inherently complicated things that we're trying to do into small pieces that are individually easy to reason about? And that's half the battle right there. The other half of the battle is how do we combine them in ways that can be reliable and also easy to reason about? So it's these two pieces, small pieces reliably composed. If you can achieve that, that's what I'm talking about. Like, that's software that you can reason about.

So this has implications for data science as well, right? Like, when you have a data science, you're doing some kind of analysis on some data and it starts out as, oh, I'm just doing these simple things. I'm, like, doing this manipulation and then I'm doing this visualization. But then as you get deeper and deeper into it, it grows and grows and grows and grows. And you're at the point where you're at the end and you don't remember where these variables came from. You don't remember, like, what's the difference between this data frame and this data frame. And you go back and hopefully start breaking it into functions. Or somehow dividing it into smaller pieces that each focus on a thing and then you join those pieces together in your overall script. So that is, that's this principle of small pieces individually able to be reasoned about.

And when you think about, like, other rules that you might have heard about software engineering, like, we know when you're writing functions, using global variables is bad. That's another one of these things where that hurts your ability to hold the entire function in your head and to prove that it'll work correctly because who knows who's setting that global variable to what value. So you can't prove to yourself, like, this function is definitely correct.

So I think I, like, if there's anything that you're working on that needs to be correct, like, if you're just, like, zipping off some analysis and, like, it really doesn't matter, you just want a pretty picture or something like that, whatever. But if you do care that the answer is right, I try never to stop when I have an answer, but I can't reason about the code. I always try to go back and do the refactoring that's necessary just so I can prove to myself that the answer is right. Oh, and the other big benefit to this is that those individual pieces, if it does turn out that there's a mistake somewhere, you can individually debug, test, unit test those individual pieces. And when there are problems, you'll much more easily be able to find them.