Resources

J.J. Allaire - Publishing Jupyter Notebooks with Quarto | PyData Seattle 2023

www.pydata.org Quarto is a multi-language, open-source toolkit for creating data-driven websites, reports, presentations, and scientific articles. Quarto is built on Jupyter, and in this talk we'll demonstrate using Quarto to publish Jupyter notebooks as production quality websites, books, blogs, presentations, PDFs, Office documents, and more. We'll also cover how to publish notebooks within existing content management systems like Hugo, Docusaurus, and Confluence. Finally, we'll explore how Quarto works under the hood along with how the system can be extended to accommodate unique requirements and workflows. PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 00:00 Welcome! 00:10 Help us add time stamps or captions to this video! See the description for details. Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I'd like to introduce Allaire here. Thanks. Thank you very much. And thanks everybody for coming.

I'm here today to talk about publishing Jupyter Notebooks with Quarto. I'm J.J. Allaire. I'm from POSIT. Tracy will be in front. Just for Tracy. She's from POSIT. POSIT used to be called RStudio. So you may not have heard of POSIT. It was probably the name of RStudio.

So, we're going to talk a lot about Quarto today. But before we talk about Quarto, I want to talk at a very high level about Jupyter Notebooks and what is special about them.

And for those of you who are users of Jupyter Notebooks or curious about Jupyter Notebooks, passionate about Jupyter Notebooks, I really strongly recommend you read this paper that was written by Brian Granger and Fernando Perez who actually created Jupyter Notebooks. And they talk about what's special and distinctive about Jupyter Notebooks.

And of course, as we all know, and we all probably benefit from all the time, Jupyter Notebooks are an interactive computing environment. But more than that, they, you know, my emphasis here, help humans to think and tell stories with code and data. Jupyter Notebooks' narrative and writing are a fundamental part of using a Jupyter Notebook.

And it really is important in two dimensions. How many of us have, when writing about how we're going to solve a problem, that actually affects how we solve the problem. Or writing about what we're going to do in the code actually affects what we do in the code. And how many of us have produced output, a visualization, a model result, various figures, but there's additional context required to understand what we produced. And that's the telling stories part.

The importance of narrative in data communication

So reflecting a little more about this idea of telling stories and telling the whole stories, some of you may have read Edward Tufte's pamphlet, which is sort of a takedown of the reductive style of PowerPoint for communicating about data specifically. He has a bunch of examples in the pamphlet. One of them specifically actually has the PowerPoint deck that was used to greenlight the ill-fated space shuttle launch. And some of the figures they presented were very much presented in isolation with some very, very short bullets on a slide. And there's actually quite a bit of important context that was needed to understand those figures. So he's urging us to do more when we communicate about data.

Actually, a podcast that I just heard the other day was an interview with Bill James. How many of you here have heard of Bill James? So you've probably heard of the movie Moneyball. And Bill James was actually the person who kind of started the revolution that led to Moneyball. And anyway, he kind of invented sort of this idea of deep data analysis of baseball and sports. But he's actually, as things have gone on, become kind of disillusioned with where that movement has gone because it's also become quite reductive, where people are obsessed with producing a single number, wins above replacement, that fully encapsulates the value of a baseball player.

And when he used to write and do data analysis about baseball, he had quite a bit of narrative that went along with it. So this idea of narrative that describe our assumptions, constraints, qualifications, that go along with the data and visualizations that we present is really fundamentally important. And that's this idea of telling stories, computational narratives, telling stories about data. And that's kind of where Jupyter fits, and also, as you'll see in this talk, Quarto fits.

So this idea of narrative that describe our assumptions, constraints, qualifications, that go along with the data and visualizations that we present is really fundamentally important.

So before I get into some of the specifics, I want to acknowledge there's quite a bit of history in this building tools to help people with sophisticated technical narratives, interactive computing, weaving those together. That goes back to tech, but it also goes back to notebooks, various implementations of Markdown, Jupyter itself. And all these tools have kind of, I think, really weaved themselves together in recent years to create a very compelling environment for computational narratives and storytelling of data.

Background on Quarto

On to Quarto. I spend most of my time now working on Quarto, but prior to that, I worked on the RStudio IDE along with Joe Cheng, who's here, I think, and who you probably heard, some of you may have heard get a talk about Shiny for Python yesterday. So I worked on RStudio IDE. I worked on RMarkdown, which was a predecessor to Quarto that was R-specific. And I also worked on the R interface to Python and the R interfaces to a bunch of other machine learning distributed computing libraries.

And I worked on RMarkdown for a long time, and I became very convinced that the ideas behind it were sound and valuable, and customers were getting a lot of value out of it. But I was discouraged by the fact that it was really only a small slice of the people that could benefit from it because it was R-specific. And so I've actually spent the last few years focusing on a new project, Quarto, which is an open source project. It's kind of a ground-up rewrite of some of the things we did in RMarkdown, but it's multi-language and multi-engine.

So we'll see this diagram a couple times in this talk, and we'll go into different levels of depth about it, but I want to give you a really, really high level. What is Quarto? It's taking computations and narrative from Jupyter Notebook. It's Markdown plus the code and whatever it produced. That all goes into Markdown, and then it's transformed by a tool called Pandoc, which we'll talk about a little bit more. It's a Markdown rendering engine, into lots of different types of output. That's kind of the core mechanic, and we'll drill into more depth at these different steps and what's important and how you can customize it and hack the system.

So the core roadmap for what I want to cover in today's talk, I want to talk a little bit more about supporting the idea of computational narratives, talk a little bit about the specific requirements around technical communications. This is distinct from writing in Gmail or writing in Google Doc. We deal with lots of special types of content and assets when we do technical communication. This idea of semantic authoring, where we can write once and publish in many places. A little bit more getting into notebooks, the pros and cons and different uses and ways they can be melded and tooled. And then a little more about how the system works and how you can hack and extend it and make it do exactly what you want it to do.

How Quarto works

So first, how do we support computational narratives? Well, the core requirements, it's kind of what I just described. You've got to be able to render executable content from Jupyter. You've got to be able to include these kind of technical content types. You've got to be able to then take Markdown and turn it into a lot of different output formats. This is, you know, Quarto does this, but it's not the first tool to do it. In fact, there's a tool called MVConvert that's been around since the dawn of Jupyter, I think, that does this sort of thing. And then there's also a couple newer projects, JupyterBook and Mist.js, that also do it. So this talk, I'm going to focus very much, kind of talk about the tooling for computational narratives through the lens of Quarto, but it's important to know these other tools exist. They share a lot of the same features, and they're definitely all worth taking a look at.

So, how does Quarto work? You can see here at the command line, where we have a notebook, and we say, Quarto, render the notebook, and that kicks off this pipeline. The notebook may or may not have been executed. It could have been, you've already executed it interactively in JupyterLab or VS Code, or maybe that you'd like to fully re-execute it. I'll show you this in a minute. We're going to want to tailor the output that we get out of the notebook that goes into Word or a website or a PDF or a presentation. We do that by providing options. And then, what you see at the top, once we have our notebook ready, we just Quarto render, and we get the output that we're looking for.

Just a little bit about Pandoc before I get into a bunch of examples. Pandoc is really, by itself, an extraordinary piece of software that we're kind of building on top of, and a lot of the reason why we're able to do, implement a lot of the features we're able to implement and support all of the output formats we're able to support. It was actually created not quite almost 20 years ago, but over 15 years ago, by John McFarlane, who's also one of the folks who's behind CommonMark. Pandoc has CommonMark, plus a lot of extensions for technical writing, definition lists, citations, more sophisticated tables, things like that. It supports dozens of output formats, and I'll get into that later, but just about any output format you can name is supported by Pandoc. And most importantly, it's very extensible. You can write custom readers, custom writers. You can filter. The documents are read into an AST that you can compute on. So it's a great foundation for building a publishing platform.

Output formats and document options

Okay, so let's start. Let's do a few examples, and this example will not be terribly impressive. It's something you've seen before. We start with a Jupyter notebook. It's just printing a few rows from a table and plotting, doing a plot. And we take that and we render it into a web page. You've seen this before. Again, nbconvert, lots of tools do this. Hopefully we made the web page a little bit more attractive than you might usually see, but very straightforward transformation. You see all the code, and you see all the outputs.

Okay, so now let's talk about the different ways you might tailor this output with options. Here we're actually gonna add some document-level options. We're gonna change the theme, the sort of theme of the output, the kind of overall CSS template of the page. We're gonna change the code highlighting style, and we're also gonna say that we want to allow viewers of the document, readers of the document, to add comments to it using hypothesis. So when we take this notebook and we render it, we see now it has a different look. The code is indeed highlighted using a different highlight style, and you can see on the far right there's this commenting bar that can be used for people to make comments on the notebook.

So those are document-level options. There's also cell-level options. So I want to focus on the cell-level options, but also note we've added some different document-level options. Here we've said we'd actually like to hide the code, but make it folded so that users can see it if they want, and we wanted to add some tools for letting the users kind of globally enable and disable source code. So that's some different global options or document options. And then here we've said that we want to provide a caption for our figure, and we want to provide a label so that we can cross-reference it. So now when we render this, you can see the code is indeed collapsed, folded. There's this code menu up here that we'll demonstrate in a minute, and you can see that the figure does have a caption, and it's cross-referencable. When I open that code menu and say show all code, and you can see the code cells open.

So these are just examples. There's actually, believe it or not, hundreds of options that affect how documents look and behave, and dozens of options that affect how computations are presented. It's extremely flexible and powerful. This is just an example of sort of how you interact with all those capabilities.

One of the things I want to emphasize is that, again, you've seen NVConvert. Making a web page from a notebook is not particularly impressive. GitHub does it, too, when you view a notebook on GitHub. But our goal is to go well, well beyond simple conversion, and some of those options that I was showing you hinted that. We want you to produce professional production quality output in whatever format you need, and complex outputs, outputs like books, blogs, websites, presentations, really nice printed documents. All those things is our goal with the tool, not just kicking out the web page.

So a couple examples of this. Here I'm going to change the format to docx, and now when I render, I'm going to get a Word document. I've got an option that says I want to put the two plots side by side. That would be a little annoying to do in Word, and this just sort of does that automatically. So that's Word. We also have lots of tooling for creating PDFs. Here I'll show you, we're going to use a bibliography. We're going to say we actually want to put citations in the margin, because the optimal reading width is somewhere around like 600 pixels, and so on a printed page actually, you actually have extra room in the margins to do things. We'll put the citations there, and then we'll also, we'll actually put our figures in two columns, and we'll use the entire page, again, using the margin. So you can see here, I've got my optimal reading width, I've got the citation and the margin, and then I'm using the entire width of the page for the plot. So lots of tools like this to make very tailored, readable output.

This presentation actually was created with Quarto. We can create PowerPoint decks, Beamer decks, and Reveal.js decks. This is Reveal.js. This is an example of creating slides, and here I've said I want to put a logo on my slides and include a slide number in the upper right-hand corner. Here you can see now I've got a slide that has the two plots, the logo and the numbers that are indicated by the options.

This is just showing a couple of plots. You might actually be trying to teach people about how to code a plot. That's a common application. You're sort of doing an internal presentation, trying to teach people how to use an API or a library. And so here, we're actually going to say, echo true, meaning please show the code, and we actually want to highlight specific line numbers so that as we're talking through the code, the audience can focus on the right thing at the right time. So here you can see we've got echo. The code is shown. Here, just line four is highlighted, and then when I advance the slide, it's going to highlight line five. So this is a tool for sort of doing technical explanations of code. And there's lots of other things related to code and slides that support internal presentations.

The project system: websites, blogs, and books

So far, I've just shown you single documents, but Quarto also has a project system for aggregating together multiple documents, and that might be just sharing settings and assets like CSS, or it might be to create a website or to create a book. That's the project system. So you can see here, I've got my Quarto project config file. I say it's a website. I provide some website options, and then I say all the HTML that's in this website should share this theme and these extra CSS styles. That's a really simple project. I'll show you some examples of some more sophisticated projects.

This is at the website for the Fast AI, a deep learning for coders course. Here you can see there's a little more going on. This is pretty similar, but here we've got some social links. We've got some tailoring how links to GitHub work, and you can see now this is a full website. So this is quite a few documents, maybe 20 different documents linked to the sidebar. You can also link to them in the nav bar and have the nav bar menu or have a combination of those things. So this is a whole website done with Quarto, written, again, in notebooks.

Here's a blog. Similar, we have some social stuff, Google Analytics, a little bit in customizing the behavior of the nav bar and when it collapses. Here the home page looks a little different. It sort of has tiles for all the different blog posts. So this is a blog done as a project. And then an example of a book. Here you can see, again, some similar options related to the Git repo that is where the books develop, but also here now we've actually indicated the chapters and the order of the chapters. And then when we render the book, you can see we have the number of chapters. We actually have the number of sections and subsections, and we actually can resolve cross-references across chapters.

So one of the interesting things about books, this book is a web-based version of the book, but, obviously, it's not super helpful if you actually want to print a book or publish the book for Kindle or iBook users. So books actually are multi-format. So you can think of a book as a website that has extra features, but you can also take that from the same notebooks and produce a PDF book, a Word version of the book, EPUB, or ASCII doc. And ASCII doc is actually the format that's used by O'Reilly for their books. So there's a bunch of people writing O'Reilly books using Quarto and then just taking advantage of the ASCII doc output and then feeding that to O'Reilly.

Technical communication requirements

Okay, let's talk a little bit about some of the specific requirements of technical communication and how Quarto supports those. So what's different about technical communication? I found this interesting that Google Docs now supports code blocks, but that's as of, like, five months ago. So if you're doing technical communication, you definitely want to be able to display code. So my point here is that tools like Microsoft Word and Google Docs are not made for technical communication. Sophisticated presentation of source code, citations and cross-references, math, diagrams, lots of things you want to do that are not well-supported by general-purpose tools.

So how do we have support for these things? I showed you earlier some of the things about code and customizing the highlighting and making it hideable. We've got some other things if you're trying to explain code where if you've got, you know, here's a code block. I want to explain these three lines of code. I've got numbers next to them, and then I've got narrative explaining each line of code below. And the way you do that is you have your line of code you want to write narrative about, you have a list below, and so on and so forth. So that's code annotation. There's lots and lots of tools for customizing how source code is displayed.

Diagrams, for those of you who haven't done this, it's actually a pretty cool way to create diagrams. You can actually write diagrams in plain text. This is a mermaid diagram, and in fact, GitHub now supports mermaid diagrams, you know, in GitHub-flavored markdowns. So if you want to write a diagram as part of a pull request or on an issue or in GitHub wiki, you can use mermaid. And then graph is an older text-based diagramming system that's still in pretty wide use. So this code here that I put in a Quarto document or a notebook results in this diagram. So for those of you who have struggled with Visio or tools like that to try to wrestle a diagram into looking how you want, this is actually a significantly easier way to do it.

Equations, of course, lots of times you need to use equations in technical communication, and here's a couple examples of LaTeX equations. And then those get automatically converted. You know, for Word docs, those get converted into MathML. You know, for environments that understand LaTeX equations, they state LaTeX equations. So those are supported for all output formats.

I showed you a little bit before figures and cross-references. Here I'm going to show a figure that actually has a caption that's a figure and two sub-figures. And to do that, I just say I've got a figure and I've got two sub-captions. And when I render this notebook, you can see it's got the figure, but it's also got two sub-figures, and those are referenceable individually. And there's lots of ways to customize figure panels. You can have two, three. You can have two on top of two and two on top of one. Lots of different ways to customize how figures are displayed.

I won't go into a lot of detail about this, but there's quite deep support for citations and bibliographies and lots of different output styles and lots of different ways of presenting citations. So that's one of the most significant kind of extensions to Common Mark that PANOC makes is the ability to handle citations really well. And call-outs is another thing you see a lot in technical books and technical communication. You can see different types of call-outs that, you know, in some ways they can amplify the importance of content, but sometimes they can indicate to the user that you don't really need to read this unless you're a specific type of reader, and we support call-outs for just about all the most interesting output formats that are targeted by Quarto.

I talked a little bit earlier about using the margins because, again, the kind of optimum reading width is not 800 or 1,000 pixels. And so you can actually put content in the margin. Here's an example of a notebook where we put an equation in the margin or even a figure or a table in the margin. It's useful because there's ancillary information that you think the user might want to consult, but you don't want to burden the main narrative with it. It's very easy to stick things into the margin.

Semantic authoring

All right, a little bit more about this idea of semantic authoring. What do I mean by semantic authoring? Well, I'll define it a little bit by saying what is literal authoring? Literal authoring is writing or creating technical content or report or analysis kind of in the native format of a given medium. So I'm writing a dissertation in LaTeX. I'm creating a presentation with Keynote. I'm publishing a confluence wiki. And that's a lot of the authoring that we do. But it has challenges.

I think one of the biggest challenges, especially when you kind of like have notebooks or like really high-value technical content, they each have their own proprietary format and it's quite awkward and time-consuming to take what you've got in a notebook, for example, and mark down and repurpose it across many different mediums. These tools typically have mostly non-existent support for the kind of technical content that I just talked about. They usually don't have any mechanism for including live code in its output. So usually you're ending up copying and pasting. And then when your notebook changes or your analysis changes, you've got to hopefully correctly update all your outputs. And that's what I'm just alluding to. There's no real straightforward way to sort of automate or reproduce computationally derived content.

So the idea, I think, is to move to, and this is really what Jupyter notebooks do and what Markdown does, move to semantic authoring where you don't literally double-click a word and say, this is bold, but you sort of write textually, this is a heading, this is italicized, this is code, this is a block quote. And then the underlying rendering engine can give it the right appearance in a given target format.

So again, semantic structure, really important when we get into talking about how the system can kind of be hacked and customized. Pandoc parses the document into this abstract syntax tree so the document is something we can actually compute on. We can filter it, we can change it, we can amend it, we can do lots and lots of things to the document after it's been parsed into this abstract syntax tree. That is not something you can do, again, with all these literal output formats. But here, since we've got the document in this generic form, it's very easy to compute on. Many of the features of Quarto and many of the extensions that you might want to use or write take advantage of this fact.

Jupyter integration and output formats

Let's talk a little bit about how that works with Jupyter. As we said before, Jupyter executes cells, creates this Markdown document, goes into Pandoc. Now, once it's in Pandoc, it's an AST. And you can write filters to transform the AST. Again, that's how we implemented many of the features in Quarto, cross-references, code-folding, layout, all kind of computing on this AST. And you can do the same thing. And then the final output gets rendered from the AST. Later on, I've got an example of doing that with Python.

But Pandoc itself, the power of having your documents in this generic form is that it now can create dozens of formats. I'm not showing all the formats that Pandoc has. These are some of them. There's the conventional document formats, presentation formats, many different varieties of Markdown, wiki formats, other formats you may or may not have heard of, but actually end up being important in different contexts. So it's an incredibly powerful and flexible system. You can even write custom writers to support systems that it doesn't itself natively understand.

An example of this, these are some things that we added to Pandoc. So Hugo, many of you are familiar with that. It's a very popular tool for building websites. It uses its own version of Markdown based on a library called Goldmark. DocuSource is another product that is used to create also websites. It uses its own flavor of Markdown, MDX Markdown, where each Markdown page is actually an MDX application. Confluence, you've probably heard of, has its own XML format. I talked about ASCII doc earlier. None of these are actually supported natively by Pandoc, but we extended Pandoc to be able to target these.

So if your organization uses Hugo to publish content and you have a notebook, you can now, with format Hugo, take that notebook, turn it into Markdown, and then have it work within a Hugo website. Similarly with DocuSource, I've got the same notebook, and I write MDX Markdown, which is the formal Markdown understood by DocuSource, and now my notebook shows up in the DocuSource website. Similarly, Confluence. Lots of organizations use Confluence as a core knowledge management tool. It'd be great if we could push our notebooks up into Confluence. We wrote an extension of Pandoc that takes that AST and transforms it into native Confluence XML, so the same notebook now just shows up as a Confluence page inside a Confluence wiki.

Notebooks as a standard container

I want to talk a little bit about Jupyter notebooks, and I think people have different opinions about notebooks. You've probably seen or heard of, maybe even attended Joel Grewe's talk, I Don't Like Notebooks, from Jupyter's Conf 2018, or Jeremy Howard's response, I Like Notebooks. A lot of controversy about notebooks, a lot of hate, a lot of love, and I think the most important thing to recognize is for all the things they do well and don't do well, they are really the corn of the realm. They define a standard container for code output and related narrative that you can compute on, and that is really, really powerful.

The fact that a notebook can be published to GitHub, or it can be published to Colab and interacted with. It can be authored by lots of different tools. It's also a source of content for embedding other documents. Notebooks are very convenient for a lot of things, and I think they're a really important thing to build around when we think about tools for publishing data and data science.

I talked about it as an authoring tool. It's a REPL with embedded narrative that's awesome for this idea of thinking and telling stories with data. Lots of different notebook authoring tools are available, and all the file formats are completely compatible.

Now, for those of you who don't like the idea of authoring in a big JSON file, that's kind of one of the biggest complaints about notebooks is, wow, how do I diff them? It's really gross that it's all in a JSON file and you've just got code. Why is it in this big monolithic JSON file? You might not know that there's a package called Jupytext, which actually has not one, not two, not three, but ten different plain text formats to represent notebooks. Some based on Markdown, some based on just a Python script that has special comments in it. You can still use Jupyter and Jupyter notebooks, even in a plain text workflow.

Quarto has its own variation of that. This is the QMD, or the pure Markdown representation of the notebook that we've been showing for the whole talk, the Pong & Penguins notebook. You can see it has the metadata options, it has code chunks, it has Markdown, other code chunks, cell options, all the same things you saw in Jupyter, just in a text file. We have a VS Code extension you can use, too. It gives you some of the similar affordances that you would expect in a notebook, like being able to run a cell, or run cells above, or run the next cell, or have code completion, and things like that.

I think it's important when we talk about notebooks to think about it as kind of an ecosystem, a standard container, and there's lots of different ways to author for them and work with them that aren't tied to the traditional I-V format.

There's another scenario which I think is really interesting, where you have a notebook that has a bunch of computations, and it maybe has the results of predictions, or it has figures, or it has tables, and you want to use that in other documents, but still not lose the connection back to the original notebook. Here's an example of a notebook that will not actually be published. This notebook will not be the main document that's published, but rather its outputs will be embedded in another document. You can see here this embed or code. Here's a QMD document. It's got narrative, and it's got this embed, which says please embed a cell from an external notebook right here. When you do that, and then publish it, it retains the connection to the notebook. You can see, I've published this document, but there are source notebooks behind it that you can link to, and here you can see for this particular cell, there's a source cell for this figure that you can go look at.

Here you may want to, for decision makers, present a narrative and figures, but then you want people to be able to go inspect the code. When you click that link, you end up navigating into the notebook view, and here you can see the code and all the narratives in the notebook, or even download the notebook if you want to run it and inspect it locally.

We're actually working with some of those other projects I mentioned earlier, JupyterBook, MsJS, with Sloan and AGU, kind of trying to find a standard to let notebooks be a standard part of the scientific record. Let them participate in peer review, don't lose, preserve them as source materials all the way through the process, and actually have the code and the notebooks be included in archives. We're very excited about that.

Extending and hacking the system

A little bit to finish up about how the system works and how you can customize and hack the system. A little bit first about how we interface with Jupyter kernels. We execute code cells using Jupyter. All Jupyter kernels are supported. I would say that Python and Julia are most widely used, and they have probably the best tooling, but I've seen people use the C kernel, the Bash kernel, the APL kernel, the SAS kernel, lots of different kernels that I've seen people use with Quarto.