
Can I Have a Word? - posit::conf(2023)
Presented by Ellis Hughes Since its release, {gt} has won over the hearts of many due to its flexible and powerful table-generating abilities. However, in cases where office products were required by downstream users, {gt}'s potential remained untapped. That all changed in 2022 when Rich Iannone and I collaborated to add Word documents as an official output type. Now, data scientists can engage stakeholders directly, wherever they are. Join me for an upcoming talk where I'll share my excitement about the new opportunities this update presents for the R community as well as future developments we can look forward to. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Elevating your reports. Session Code: TALK-1156
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, everybody. My name is Ellis Hughes. I'm a data science leader at GSK. So before we go forward, I just want to say all opinions are my own and do not necessarily represent that of my employer. All right. You can also find me on various social media sites such as LinkedIn, YouTube, GitHub. And today I'll be talking about gt and how I went through adding Word as an output to gt.
So gt is a fantastic library made by Rich Iannone to help you make fantastic, beautiful tables very, very easily. All you have to do to get started is library gt, drop in your data set, put it in gt, and for free, you're going to get a pretty good table just with that one line of code there. Text is left aligned, characters are right aligned, and it looks pretty good just from the get-go there. But the power of gt isn't this one function. No, it's got a lot more to it.
In fact, you can make some pretty fantastic tables using gt. There's a ton of syntax, a lot of very easy to use functions that allow you to add a lot of complexity to your tables. These are two examples from the posit table competition that was held earlier this year. I really like how Georgios, the one on the left, had images included in his table. He has the little symbols, the icons. Rich talked a fair amount about icons a little bit earlier today. So you know he loves them. So you know he loved this table. And then Nicola, she included graphs inside of her table, inside of her gt. Tables are a fantastic data viz tool. We just need to use them better.
Saving gt tables and Word output
So we can do some pretty crazy things with gt. But you typically don't want to leave them inside of your console. You don't want to just build them inside of RStudio or wherever you're working. You need to save them and give them out to other people. And that's where the gt save function comes in, where you pass in your table object, your gt table object, and you pass the file that you'd like to be going out to. In this case, we're going out to HTML, but it supports a wide variety of different outputs, RTF, LaTeX. You're able to generate PNGs. It does some pretty, pretty fantastic things.
And as of last year, August 2022, in 0.7.0, Word was added as an output to gt. And that was work that I did with Rich. And so now, if you want to go and save your gt from any sort of output, rather than going out to HTML, you simply replace it with DocX, and it works, and you've got your table. Thank you.
How much time do I have left? About 17 minutes. Oh, okay. Well, I do have some backup slides, if you wouldn't mind giving me a moment to go through that as well.
Why stakeholders want Word documents
So our roles as data scientists are to be reaching out and working with decision makers, stakeholders, people that really need to be making decisions. And so it's our responsibility to be pulling information out of the data and giving it to them to use. And so you're a data scientist at company A, and you're looking at a regression. Here we're using MT cars and looking at MTGs. But this could be any sort of analysis you're performing. And you generate this output there. You're like, great. We know what we're supposed to be doing. We've got a fantastic output. But you shouldn't be sending this to your decision makers. They have no idea what that means. They're not a data scientist.
So it's up to you to figure out how to show it to them. But you're also not like other data scientists. You're a cool data scientist. And you've got all these fantastic tools available for you to be going out to them. You can make interactive plots. We've had a fantastic presentation on that. You can use Quarto. You can pick plumber APIs, R Markdown. There's all these wonderful tools available for us as data scientists, especially our users, to make fantastic outputs.
The thing is, you give it to them, and they go, that's great. But if you could give it to me as a Word document, I'd like that better. So why are they asking for a Word document? You just gave them a beautiful HTML document. You made them a wonderful Flex dashboard that they could be using. Well, the issue is, your boss works there. Well, and if he doesn't, your boss's boss works there. The finance team, marketing team, let's face it. Basically, every team outside of data science is using Word. Because it's easy to use. You can write a lot of text. You can copy and paste content in there. You can write comments. You can do revisions. There's so many tools that exist in Word for you to be using. And they're not data scientists. That's where they're comfortable. So it's our responsibility to meet them where they're at.
So it's our responsibility to meet them where they're at.
How the Word output came to be
So let's not worry about the details of the slide here. But really, what this is brought up to say is that I was working on a project where we were trying to build some data sets that were able to go out to a wide variety of different outputs there. And our team created a package called T-format to facilitate that. And as part of its engine, we wanted to be using the gt package because of the variety of different outputs it already had baked in for us. We were really stoked on that one. However, we had two types of output that we needed to support. Submissions and in-text CSRs, which are essentially Word documents.
And at the time that we started working on this, this was before I got involved, there had been this issue that had been open for three years. The possibility to create an as Word function. And it's not like it had just been sitting there. People had left comments. People left comments. People left comments. And work had been going on. But, you know, it's a difficult problem to solve. So, I reached out to Rich. I'm like, hey, Rich, this problem seems like it's been there for a while. Would you mind if I were to help you out? Let's talk. Let's figure out what we can do to make this happen. So, that's how I got involved in helping Rich go out to Word.
So when I got involved in gt, it had been around for a number of years. Rich had introduced gt to the world in 2019. It was actually the first RStudio conference that I went to. So, it was really exciting to be able to participate in that. It was at gt version 0.6.0. And it already supported a wide variety of outputs. It went out to HTML, HTML, PNG, tech, and RTF. And for every single one of those outputs, he'd created a whole function to process that table and convert it to that output. He figured out every single time for every single output how it would go. And so, the same had to happen for Word. And he'd done a fair amount of work on that already. He'd already made 50 commits. He'd done 1500 lines of code already and had a functioning as Word function in there.
Understanding DocX and Office Open XML
So, before we go into how that all works, let's talk about Word and Doc versus DocX. So, some of us, maybe the older ones of us, will remember the Doc format that existed for a long time. So, the Doc format is a binary file. It's a proprietary format. But really, it hasn't been used since 2007 for our benefit. The DocX file, however, offers a lot more functionality for us. It's actually under the hood, it's a zip file. You can actually right click and unzip your Word document and see all the contents that's in it. It's based on an office open XML format, which is public and available and you can find it online. And it was introduced in 2007. And because it's like that, we can also go out to DocX.
So, yay. Once you unzip it, you'll see a whole mess of files. And realistically, you can kind of ignore most of them. The only file that you need to be concerned about is the document.xml file. And that is where the body of the contents of your Word document gets contained. There's a lot of other XML files. If you're doing more complicated office open XML type work, you will need to deal with them. But for our purposes in this talk, let's just worry about talking about the body there.
So when I got involved, I was like, oh, XML? I've worked in XML before. I've done web scraping. I've played around with these things before. And I know HTML. So, it's not really going to be that big of a deal. But it was. In HTML, you get to see these elements that are very clear. They're very well defined. Not that office open isn't. But it's got the style and then it's got the additional information about styling. And then you have hello world inside of it. So, if I wanted a pink hello world, this is the only line of code you need to be writing in HTML. This is what you need to write if you want to go out to a Word document pink and bold. There's a fair amount more code. And this is just for the text. This isn't the document body or any of the other XML that's involved with Word.
Trial, error, and learning
So, the way that I went about to learn this, and this is really what I suggest to you as well, if you're going to be involved in a project, is take some time to get some understanding of the surrounding area. What's been going on? So, I started reading officeopenxml.com. It's a fantastic resource. If you want to learn about Office XML, which, I don't know, that's up to you, it's a really great resource on that. Also, while I was working on this, it was really just a lot of trial and error. I would build it out in the Word document, see if it or build it out the way that I thought it should look like, save it, unzip it, inspect it. Again, go try it, code it up myself, try to open it up, and I got a lot of these. And a lot. And a lot.
And this is something that you may experience, too, as you start to go and help out with new projects, is you will go through some errors, you will have some frustrations, but I promise you at the end of the day, when you see this big blue box saying Word is opening, or whatever is for you, you're going to be pretty excited. I did it. And so, through a lot of trial and error, a lot of effort, a lot of those error screens, and eventually some more of those blue screens, and eventually more blue screens than errors, we ended up adding gt, or Word to gt.
What's new and what's next
So, now going back to our data scientist, they are a cool data scientist that will meet their leaders where they're at to help them do their work. So, they're going to take their linear model, throw it into Broom to tidy it up, make a nice little tibble there. Then they're going to throw it into gt, add a header to it, add some coloring to it, and save it to a doc X, where they can add a little bit of their conclusion at the very bottom there. Saying, you know, if we want to increase the number of cylinders and the overall weight of the car, we can make a vehicle with a negative MBG. And give that to their senior leaders to make a decision as to whether that's, you know, what they want to go with.
So, Word shapes thoughts, MS Word shapes documents. Everyone uses it. So, gt and the efforts around Word are still evolving. It's only been out for a year. We're still trying to build out a lot of features. It's not my full-time job. But recently, we added some fun new features to support formats that gt supports in HTML. We're not going to be HTML. HTML is fantastic. It does a lot of things. There's a reason we all love it. But Word does a fair amount of things, too. And we wanted to try to support that. So, first off, we now support markdown. So, you can write inline markdown inside of your tibble. And it will go out and throw it into your format gt. And you can still save to a doc X. That works. You can also add images to your Word document tibbles. Or gt tables as well. So, it supports a variety of those things.
However, the plethora of formats that Rich has built in, they're not necessarily supported. So, your mileage may vary. Additionally, styling does need improvement as well. We're aware. Still working on that. Such as tab options. But the way that we can know how we can help you and the things that you're looking for is letting us know on the gt page. Open an issue, tag it as a Word issue, so that we can help make sure that the contents that you need in gt are available. So, you can work with your decision makers to support them in the decision making that they need to be doing.
And, you know, we're experimenting. We're going to be figuring out what are the other things that we could support. Now that I have some experience with Office Open, hopefully you'll join me and also play around with that. You know, there's other outputs that Microsoft supports, such as, I don't know, PowerPoint. Would you be interested in that? XML? Or Excel? I don't know. Let us know. Next, there's a package that I developed as a companion to gt to support folks that like using Office R, called the GTO package. So, the GTO package allows you to use the body add gt function to take your gt tables and add them into your Office R pipeline without doing anything new or different or scary. And so, with that, can I have a word?
Q&A
Thank you. Congratulations on being a cool data scientist. As always, you can ask questions via Slido. I'm going to ask one that is not gt related, but I find it very interesting that you sort of walked in and contributed such a big feature to a package like gt. So, do you have any advice on anyone looking at this like, ooh, I want to contribute like this?
Well, I mean, some of it's just getting started. Honestly, I didn't have really any experience with Office Open XML before I got involved in this project. I'd kind of seen it before, but I wasn't familiar with it. But I took a leap of faith that, you know, Rich is a fantastic guy. I loved working with him. And most maintainers of open source tools are pretty fantastic people, especially in the R community. And so, if you're willing to put in the effort, to put in the time, trial and error, work with them, eventually you can build something that will, you know, change the package. But even still, it doesn't have to be like a big contribution, right? Like you can incrementally add things, documentation, fix a bug or whatever. Really, open source is a beautiful thing where you can just help out with the tools that you're using.
Really, open source is a beautiful thing where you can just help out with the tools that you're using.
Well said. So, we do have a question about, do you have any plans to replicate this for PowerPoint? For PowerPoint? Open an issue. The XML is a little bit different than PowerPoint. So, it will take a little bit of time and rejiggering. And as I mentioned earlier, rewriting how exactly you do all the styling for PowerPoint outputs, but it's possible in theory.
Okay, excellent. I think you did such a good talk that every question is answered for right now. Fantastic. Thank you very much. Thank you very much.

