Gordon Shotwell - An overview of Quarto, and Jupyter

Gordon Shotwell Data Scientist & Product Manager @ Socure Gordon is a data scientist and product manager at Socure, where he helps data people build better software

Oct 19, 2022

14 min

Halihax Halifax Startups Technology Data Science Quarto

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everyone. My name is Gordon Shotwell. I'm a data scientist at a company called Socure. We do fraud prevention. And I'm going to talk to you about Quarto and Jupyter. How many of you are Jupyter users? Yeah? Okay. So, this is a website. You can follow at Quarto.Shotwell.ca if you are interested. So, I, a lot of my work is running a shared data science server. And Socure has grown like crazy over the last three years. So, it's been, it went from something like 20 users to more like 100. And it's kind of like ballpark the editors that people use on the server. So, a lot of Jupyter users, some VS Code people, and some RStudio people.

What I noticed is that all of my headaches come from the Jupyter users. Every time there's a problem, every time something breaks or something can't be launched, I just say Jupyter. And I started asking, like, why is this? Why does this editor cause me personally so many problems? Not that I use it. I'm not a Jupyter user. But other people break my server using this tool.

The Jupyter development pattern problem

And I think the key problem is that Jupyter uses like a fundamentally different development pattern than other computer science tools. And this causes a lot of problems when you try to integrate it with any of these other computer science tools. A lot of the tooling around programming is built around a different paradigm than Jupyter. And the way I put this is that they differentiate between the editorial environment and the execution environment. There's usually a clear, bright line between the times when you're editing something and the times when you're running something. But Jupyter blends those two together. And that's really part of the design of the software. It's supposed to be an interactive notebook environment that lets you play with your data and move back and forth.

But it causes a lot of problems. So, some of these problems are because you're kind of like the user is also executing the data, they might do that kind of out of order. And that state can become very complex. It can become difficult to reason about or record. The code itself can cause sort of deep, profound problems to your editor. So, you might run something weird, do some sort of odd memory thing. And then suddenly, like, just Jupyter doesn't work for a long time. And you don't know why. It's very hard to figure out. It's not just like something crashing, but something kind of deeper happening to the software.

And then integrations of other computer systems can be difficult. In particular, Git can be difficult. So, this is the kind of JSON that a Jupyter notebook is stored. So, when you just naively try to check in a Jupyter notebook to Git, you try to use any of the systems that kind of rely on Git as GitHub or something as an input, you kind of end up with this thing. And this is really easy to make mistakes with. So, you can, for instance, check credentials into your Git repository. You can check in PII that's not supposed to be in a Git repository. Just because it's this giant, unreadable JSON blob. It's very difficult to review.

So, recently I came across this wonderful blog post about how the Jupyter Git problem has been solved. And the thing that I kind of was thinking about when I was looking at this is that it just shouldn't take this long, right? It shouldn't... Git is a basic sort of thing that's existed for a long time. It works with almost everything. Like, it shouldn't take, you know, like, millions of dollars and ten years of dedicated work to get the system to function on Git.

it shouldn't take, you know, like, millions of dollars and ten years of dedicated work to get the system to function on Git.

So, the way I'd sort of put this is that, like, to kind of like break it out a little bit, put a little graph on it, most programs use this kind of like write-execute model. So, you have, you know, you're writing your code, and then you have this source code, which is kind of like a rigid... It determines what the outputs are gonna be, right? So, this is kind of like executor. It goes and reads this source code, sources it, and then produces some outputs. This creates a lot of wonderful things because this actually is a useful data structure. It's a thing you can sort of track over time. It's something you can analyze to do static code analysis or something like that. And this executor is kind of like always does the same thing. Like, there's no humans in this part. So, you kind of have confidence that once you know what's happening here, you know what's gonna happen here in the outputs.

But Jupyter uses what I would call kind of like a re-execute model. You write the code, and then that kind of creates this sort of like code output construct. And then when you're kind of like running the code or working through it, that's actually modifying the sort of code output structure. And there's ways that you can sort of like parse the code out of that. There's ways you can kind of parse the code out of it. But the tool itself kind of pushes users to have this kind of like highly unsteady interactive workflow. And I think that's one of the things I've observed about people who are very, very skilled and excellent data scientists. But because they kind of maybe learned to do this as the way that they work, they kind of have a little bit of a moat around moving to sort of like more general computer science workflows.

The R Markdown write-execute pattern

So, R actually had a different whole pattern to notebook development, which is these write-execute notebooks. So, when you write an R Markdown notebook, this is RMD style, you actually just write the sort of code and text together in a single document. And just like source code, that document is kind of a rigid artifact. It's just text. There's no logic in the R Markdown notebook. There's no state in the R Markdown notebook. It's just text. And what the renderer does is take that document, go find all the things that are labeled as code, pass that code through an interpreter, and then put the code and the text back together into these output formats.

And this is a really wonderful pattern, because like most of the things work right away, right? So, there hasn't ever been like an R Markdown Git parser, right? Because R Markdown documents are just text, so they just work with Git, right? It's just like any other computer source code from when there were punch cards, right? It's a text that's then interpreted. Similarly, if you ever use like NuMake or any of those orchestration tools, they work really well with this code, because that's kind of what they're expecting. They're expecting text that's then run through some kind of interpreter. Similarly, it's really, really easy to compose R Markdown notebooks. You could either copy and paste them together or parse them together in some other format. It's easy to package libraries, and it's also really easy to mix languages. So, you can, I think, use something like 17 different computer programming languages in an R Markdown document, and they'll all play well together, because it just goes and finds the Python code, runs it through a Python interpreter, goes and finds the Scala code, runs it through a Scala interpreter, right? And these are all kind of well-solved problems in computer science.

But there's three big problems. The first problem is that it starts with the letter R, right? And so, I work with a lot of Python users, and, you know, it's just, it's their least favourite letter. If you propose any kind of tool or solution, and that solution starts with R, or even has an R in it, they just say no. The second sort of maybe more real problem is that it requires an R runtime, right? So, R is a very niche programming language. It's not sort of necessarily installed on everybody's laptop. You know, it's not installed on every server, so that can be difficult. And the last one is that it has a slightly inconsistent user interface. So, R Markdown is a ten-year-old technology, and over time, it's kind of been elaborated on to do a lot of different things. So, like, people write their blogs in it, or my blog in R Markdown. This, you know, you could write presentations in R Markdown, books, you know, and then also just kind of data science reports. And over time, it's accommodated all of those use cases, kind of started adding different options, and those options started drifting in terms of how consistent they were.

Introducing Quarto

So, this is where Quarto comes in. Quarto is a new project from RStudio, and the main feature of it is that it's a fully language-agnostic version of R Markdown. So, it's a TypeScript command line utility. It doesn't have any R or Python dependency. You can use it for just R, you can use it for R and Python, or Observable, or JavaScript, or Julia. It also has a lot better branding, just because, speaking personally, I've had at work many, many times where I've just seen people waste, like, huge amounts of money on things that really could have been a scheduled R Markdown document, right? And I have not been able to convince a single one of them to use a scheduled R Markdown document to solve their problem. But every one of them, as soon as I talk about Quarto, they're really excited about Quarto, because the capabilities that it has and how simple it is are really powerful. But it doesn't start with the letter R, so it kind of, like, gets through the gate. And then, lastly, it's got a unified extensible interface. So, it's kind of, the interface is focused a little bit more on Pandoc, which I'll get to in a second. And so, that means that it kind of, like, it's a lot more consistent and easier to work with.

So, how does this work? So, you write a QMD document, which looks a lot like an RMD document, if you've ever been familiar with, if you're familiar with that. So, it's Markdown with sort of specific code chunks. And then, when you render that document, Quarto goes and does, passes the code through an executor, or through an interpreter, and then puts together this Markdown document, which is, so, the output of the code is turned into Markdown, and the text is put where the text goes, just as Markdown. And this goes to a system called Pandoc, and Pandoc is this really incredible tool for turning documents into other documents. So, it has, basically, what's called, like, an abstract syntax tree on top of documents. So, it says, like, you know, for a PDF, this part of a PDF kind of maps on to this abstract syntax tree, which then maps down to a Word document, or a PowerPoint document, or something like that. So, with Pandoc, you can take this Markdown file and render it into, you know, lots and lots and lots of different formats. So, this slideshow is done that way. You can do HTML, of course, websites, books, all these types of different things.

This is what the code format looks like, if you're totally unfamiliar with it. So, this is just something you can write in a text editor. And, basically, the sort of two things. So, most of this is Markdown notation. So, those are, like, headings. And then you have these little backticks that define, like, inline code. So, this would be, like, an inline code chunk. And then you could have code chunks that sort of specify the language here, and then you can just run code. So, what Quarto does is it goes and it finds anything like this, passes it through the interpreter that's labeled by this thing, and then produces the output. So, in this case, that would produce an output that looks like this, right? Does the calculation of the two, puts out the text. It's like that.

So, the main sort of advantages of this is that it moves you from this kind of, like, fuzzy place where you're not really sure what your source code is. Like, is it the sort of document you've produced, or is it the one where you produced and then ran some chunks? Instead, you have this sort of clear idea of, like, the QMD file that you're working with. That's the source code, right? So, that's the thing that you can check into Git. You can share with people. You can test more easily. You can analyze more easily. It gives you all the benefits of Pandoc. So, you can render really beautiful sort of full-featured PDFs. You know, there's lots of different academic journal formats, things like that. Great caching, and you can compose notebooks together. So, you can have lots of notebooks together that make a website. If you want to look at a website like that, you can look at my website.

But you don't actually need to abandon Jupyter to start using this. So, you can still use Jupyter as your main editorial environment, because Quarto can render Jupyter notebooks without altercation and preview them. And then you can also convert really easily back and forth between a Jupyter format and a Quarto format. So, you kind of don't need to actually change much about your work. And then one thing that I would recommend is this NBDev, which is by Jeremy Coward, who does fast AI. He kind of uses Quarto as basically the publication layer on top of that kind of notebook development pattern.

Getting started with Quarto

So, the way to get started, you can go to Quarto.org. It's also included with the latest release of RStudio. So, you don't need to install anything. The RStudio visual editor is really wonderful, whether or not you're an R user. It's a really great visual markdown editor, which kind of lets you sort of use these things really well. There's a VS code extension that's also really good.

And so, the main sort of rules here is that what you should try to do is store the code asset as a QMD document. So, that's something where it's like, if you're working with Jupyter notebooks, great, work with it. And then you want to at some point write and check in this text file that's just the document in the report. And when you share it, you should share it with that thing and then probably some kind of environment definition, like if you use Conda or Virtualenv or something like that. And the thing that you share with people who want to read the code, who aren't going to go execute and work on the code, but just want to read the report, is the render document, which can include both the code that you used to do it as well as the prettily formatted output. Yeah, and that's what I've got.

So, you can contact me at my website or tweet me at gjotwill.

Featured software#