Keynote, JJ Allaire: Reproducible Manuscripts with Quarto

Transcript#

This transcript was generated automatically and may contain errors.

Thank goodness for Wikipedia. I'm able to introduce our next speaker. I've learned a lot from him in face-to-face contact. I didn't know, though, that JJ Allaire is a creator of a blog publishing product called Windows Live Writer, initially released in 2007, distributed by Microsoft as part of Windows Essentials. And in 2008, he co-founded FitNow, a company dedicated to mobile health. 17 million users. And now, he will teach us about Quarto . Thank you, JJ, and welcome.

Thank you very much. It's great to be here. I'm going to talk, my subject today is Reproducible Manuscripts with Quarto. This is not going to be a Quarto 101 talk, because I bet a lot of people here know a decent amount about Quarto or R Markdown. I will cover some foundational basics, but I really want to tell you about some brand new things that we're working on with Quarto that are aimed at reproducible scientific publishing. So, that's going to be the bulk of it.

But I will go a little bit into what Quarto is, kind of the philosophy of the project, the goals of the project, not so much the mechanics of how it works. Talk a little bit about the relationship between notebooks and Markdown and scientific publishing, and then talk about this new set of capabilities that we call Quarto Manuscripts.

What is Quarto?

So, most of you probably already know this, but the simplest way to think of what Quarto is, it's the next generation of R Markdown. We saw R Markdown in the last video. Most, I think all of you know what that is. And really, Quarto has really two goals, one of which is to make R Markdown work across different computing ecosystems. So, the Jupyter ecosystem, the R ecosystem, the Julia ecosystem. So, broaden it in terms of computational environments, and also take the things that we learned over the first 10 years of working on R Markdown and really kind of create a next generation of those capabilities.

So, again, open source scientific and technical publishing system. We do support quite a few different computational environments. We support Knitter and R, Jupyter, observable JavaScript. We actually have some folks from the Julia community working on adding a Pluto engine right now. So, the idea is that we're going to build all these publishing capabilities, and then there will be lots of different compute engines over time, and we'll be able to work with all of them rather than being tied to a single one of those. Obviously, the underlying Markdown engine is Pandoc, but we've added lots of enhancements for scientific publishing. Pandoc obviously has some very robust citation capabilities, but we've added lots of other things. And then tried to let you create lots of different types of output, both simple documents, but also presentations and blogs and websites, books, etc.

So, lots of other projects have had similar goals and similar features. I list a few of those at the bottom. So, we're very much in debt to all that work and continue to learn from those projects as we build Quarto. Again, I've covered most of this open source project sponsored by Posit, building on R Markdown, and just I would say I was frustrating. I think we really had a lot of conviction that the ideas behind R Markdown were really sound and useful, and had the chance to positively impact how science was done, but it was frustrating that it was only in the R ecosystem that the tools could be used. And so, we really wanted to, you know, our observation was that certainly Jupyter was quite broadly used. There were other new environments like Julia, and we wanted our work to be applicable across all of those. So, that's really kind of how we got started working on Quarto, which was about three or four years ago.

And in terms of the really high level goals of the project, this is one I don't need to explain or sell very much here, but make scientific work more reproducible and more automatable.

Another thing we like to think about is that if you think about tools for writing scientific manuscripts, you know, Word, you can see the blue line, it starts off very easy and very quickly it gets totally out of control. LaTeX is harder to start with, but it's actually once you learn LaTeX and learn the basic mechanics of it, the curve is actually relatively flat. It scales really nicely. And then traditionally Markdown has had, it's sort of a mixed story. It's simple at first, and then you try to do advanced things, and you get into all kinds of weird hacks and things like that. So, I think with Quarto, really what we really want to do, and I'll show some work that I think aims at this, we'd like to have an environment that is as close to Word as possible in terms of getting started in the initial ramp, but that also has the scaling characteristics of LaTeX, that as you do more and more complicated things, it works really, it grows seamlessly with you.

And then of course, this idea of single source publishing, which the manuscripts feature will really underscore the idea that we want to author our content in a universal format that can be repurposed across lots and lots of different mediums. Because obviously scientific papers, it's still very convenient to have them in the PDF, but also great to have them on the web and have them on mobile and other places.

So, I think with Quarto, really what we really want to do, and I'll show some work that I think aims at this, we'd like to have an environment that is as close to Word as possible in terms of getting started in the initial ramp, but that also has the scaling characteristics of LaTeX, that as you do more and more complicated things, it works really, it grows seamlessly with you.

Okay. I think I've mostly covered this. Why did we create a new system? It's really just this idea that the languages and runtimes used for scientific discourse is very broad, and we really wanted something that could match that. And kind of the idea is like do all these deep publishing capabilities and markdown capabilities and tooling, do it kind of once and make it very broadly applicable.

So, as I mentioned before, there are several different sort of compute engines. Knitter engine essentially gives us very, very close compatibility with our markdown. Most RMD documents can just be run in Quarto without modification. Jupyter is another engine. And then we've done something with ObservableJS, and as I said, others are possible, and there is active work on a Pluto engine for Julia.

And the way the Knitter engine works is exactly, this is the exact same diagram we used to use to explain our markdown, but instead of RMD, there's a QMD, uses Knitter, makes it very compatible with all the existing RMD files. One difference to highlight is that chunk options are generally encouraged to use YAML rather than put them inline. And there's just an example of a Knitter engine document, and you can see, very familiar for people who've used our markdown, you can see cross-references in here, you can see the label and caption in the code chunk.

So, that's the Knitter engine. The Jupyter engine actually has two different modes. One is you can use a QMD, which is like an RMD, so it's plain text. You can use any text editor to edit it, works very similar to RMD, and in that case, what happens, we take the QMD, we actually turn it into a Jupyter notebook, and then run it through our engine. So, that's extremely analogous to how RMD works. And there's another modality, which is I have a notebook, I have an existing Jupyter notebook, and I just want to render it, and that works too. So, in that case, you use all the tools that you normally use for editing Jupyter notebooks, whether it be JupyterLab, or VS Code, or what have you, Google Colab, and then you can put it through the same pipeline.

Now, in this case, there's no, by default, we do not re-execute the Jupyter notebook. This is one characteristic of Jupyter notebooks that people like a lot, is that for very, very expensive computations, you can control exactly when the computation is done, and then essentially, by not re-executing a cell, you don't have to pay for the computation again. It creates reproducibility problems, so it's not a total panacea, but it's certainly something that lots of folks in the Jupyter ecosystem do take advantage of.

And then, in terms of tooling, we have the ability to do side-by-side preview with VS Code, and Jupyter notebooks, and things like that. We have a JupyterLab extension for Quarto, and we have a VS Code extension for Quarto. There's a NeoVim mode that's pretty good, and there's some integration with ESS as well.

So, hopefully, providing all these writing tools in the visual editor will get more people willing to engage with this toolchain, and get more people working in a way that's like end-to-end computation, has computational integrity from end-to-end.

Composing manuscripts from multiple notebooks

All right, this is actually really important. One of the things in our Markdown documents, and I think probably various of you have different workarounds for this, is that everything happens in the main document. So, all the computations go from top down in the main document, but in a lot of actual scientific papers, there are many sort of subcomputations and subprojects that you want to compose together in a paper. So, what I'm going to show here is basically making a separate QMD file, and also a separate Jupyter notebook, and then incorporating them into the main document. And then, again, they're completely navigable and referenceable from the main document.

So, I'll show that quickly here. So, here, we'll make a directory called notebooks. We will put a QMD and a CSV file in there that we've already written. So, this is just a separate kind of sub-document that I'm using to explore the earthquakes data. And we'll render that. And once that's happened, this is now something I can use in the main body of my paper. And so, you can even imagine teams of people where I'm just going to work on this notebook. It's going to create this table. I'm going to give it the right label and caption, and then you can just pull that in.

So, here, you can see I'm going to introduce an embed. So, we have this short code embed. It says embed a cell from a QMD that I have that separate. So, here, we're going to reference the QMD, and then we're just going to use the ID. And then, when we render this, it's going to pull that in from the other QMD and make it part of the main paper.

So, you can see how that works here. There's the table. That was, again, created in a different notebook. If I want, I can actually go and explore that notebook and see how that table was computed. Similarly, you can imagine a colleague. There's referencing the table. Right, and that'll number it and resolve the reference and kind of do what you expect here.

And so, really, for working on kind of more sophisticated and non-trivial manuscripts, being able to compose tables and figures and computations from lots of different sources is really quite useful. And then, also, preserving them through the publishing process. So, there you can see that's the visualization from the Jupyter notebook. And if I'm reading the article, and I say, well, how is that visualization created? Or there, how is this table created? You can see the table was created using this Explore Earthquakes. And then, how is this visualization created? And I click the data screening, and I see the Python code that was used to create the visualization.

So, that's a pretty important idea in sort of having more flexible and composable and cross-language workflows for creating these manuscripts, while also still preserving reproducibility.

So, that's a pretty important idea in sort of having more flexible and composable and cross-language workflows for creating these manuscripts, while also still preserving reproducibility.