Resources

Bold indicates negative? (Luis D. Verde Arregoitia, Instituto De Ecología, A.c.) | posit::conf(2025)

Bold indicates negative? Speaker(s): Luis D. Verde Arregoitia Abstract: Over one billion people worldwide use spreadsheets to manage and analyse data, often styling cells and their contents to highlight or distinguish values. This formatting is often used to encode additional data (for example, indicating groups with colors) but most data science tools are unaware of data expressed as formatting. This talk summarizes my progress addressing the gap between formatted spreadsheets and the modern data stack. First, by championing data organization best practices, then by bringing spreadsheet contents and their formatting together into R for further analyses (unheadr package), and finally how translating format to data helped me develop tools (forgts package) for converting formatted spreadsheets to gt objects. Slides - https://liomys.mx/posit25.html posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right, how are you, everyone? And thank you for being here.

So, let me start with this post from social media that many of you might relate with about offering to help someone with their data. And then we encounter something slightly ridiculous, in this case, bold equals negative.

Literally meaning that in a data set or a table, instead of actual minus signs, the negative values were indicated with bold text. That happens.

So, we start to wonder, who does that? Right? Who's doing that?

But before we get to that, let me define what that refers to.

And that refers to encoding data as formatting, which is this very common practice of usually storing and sharing data, often in spreadsheets, and then using formatting to intersperse and layer data where it probably shouldn't be.

So, let's look at one example. This is from the real world, just a little simplified, where we get the classic bold indicates negative values for this permeability variable, and also some cell colors, because why not?

Let's look at another example where we have different colors for the text being used to indicate the units for the variables. So, you have black for pounds, blue for kilograms, and then we have further information about one of these variables, the one that says microchipped, encoded as the cell color or cell highlight color or background color, whatever we call it.

So, now we can start to wonder, who does that? And the answer is, a lot of people, because there are over a billion spreadsheet users out there in industry and research and government. And anytime anyone has looked at public collections of spreadsheets, myself included, between 25 to 60% of public spreadsheets use either cell or text formatting or both, and it won't always be used just for emphasis or decoration.

So, a lot of the time, with so many spreadsheets out there, data will be encoded as formatting. It's just hard to know, or it's hard to identify that programmatically, so I can't have that figure for you.

And this is happening partly because we go into data analysis with the tools we know or the tools we know about, not necessarily the tools we need, and a lot of the time, this happens to be spreadsheet software.

What we can do about it

So, before anyone starts getting too outraged, let me revisit the data from the Tibet Clinic and show you what's actually possible and something we can do right now in R. So, let's look at this table or this spreadsheet. It just looks nice here. And specifically, the two problematic columns with the formatting that we see there.

I'll show the code later on, but we can actually do this now. So, import the cell values, import the formatting, and it gets, we convert it into these notations inside parentheses, telling us which of the values are the text is bold and colorful. This is ARGB color code. And for the other variable, we know now which of these rows or which of these values has the background highlight color.

And with a little bit of data wrangling, we can actually arrive at a more analysis-ready, machine-readable version of the data. The information that was encoded only in the formatting, it's now made explicit in its own variable, like units. And the other piece of information that was encoded through highlight color or cell background fill is now also made explicit in its own column, chip status here.

So, by now, I hope we all agree that using data as formatting is not necessarily the best idea.

First of all, because it doesn't play very well with the modern or the current data science stack, regardless of programming languages or tools or whatever. And also problematic is that this introduces accessibility issues. Formatting like that is not good for screen readers. It's also not good for colorblind people. So that's an issue.

And again, these kinds of things happen because spreadsheets, they relax the rules of data organization. As humans, we can typically look at ambiguities and resolve them either through context or just from our prior knowledge. And machines, not so much.

But now let me tell you what we can do, what we are doing already, what we can do moving forward. And I feel like what we do and what we can do tend to fall into one of the following three categories or approaches.

So the first and the more ideal one would be to prevent this thing from happening in the first place. To me, it'd be great if school children and college students and our own colleagues would read these kinds of resources. And in particular, this guide here about data organization in spreadsheets, which I consider to me the ultimate reading material for organizing data in spreadsheets. And this particular resource includes good advice of not using data in formatting.

But that's not always the case. Another example or another approach would be to just ignore the problem. And we can maybe in a particular project, in a particular field, in a particular job, we might not even encounter spreadsheets, let alone colorful formatted spreadsheets. So in that case, we could be like the guilty dog there. If we don't look at it, it's not there. So it's not really our problem and we can just move on with whatever we need to do.

But that's also not always the case. And then we arrive at the third sort of broad category of either having to go and fix things manually, edit the files ourselves, ask for them to be reworked and then sent back to us. Or if we really want to do it programmatically in R or Python or whatever, then we get these numerous case-by-case approaches like this one here.

This is part of a two-part blog series, quite recent actually, and I liked it, where this involved edits to the underlying C code to read Excel to be able to pull in and deal with conditional formatting. So not something a lot of us can do. Maybe here, yes, but not me personally. Or this other example I have here from last year where poor Jeremy was loading in like 15 different packages and having to write a ton of code to deal with what was a fairly simple table except it was made problematic because of data encoded as formatting.

Community tools and my own contributions

And these are just a few examples, but there's tons of these kinds of things, of people working through formatted spreadsheets with whatever tooling they can get their hands on in R.

But I'm not here to make fun or take cheap shots at spreadsheets and their users. Rather, I want to mention now what the community has been doing to move forward and to actually enable us to work with this, including some of my own contributions.

And I'll start with this one from myself. So this is a blog post from 2016, so about nine years ago, the first of many blog posts of mine that used, well, the first of many blog posts of mine on the topic. And at the time, this was a bit created out of frustration, but I tried to be nice when I wrote it. So I just mentioned three bad things, three things, bad practices, or three things to avoid altogether, including not using the whole formatting as data.

And at the time, I didn't think much of it, but this post itself was quite well-received, and I realized that it was the content and the principles were starting to be incorporated directly into proper data science teaching materials, which was quite motivating for me.

And later on, we formalized that into the scientific publication. A bit niche and aimed at not just biologists, but mammal biologists, but still, I'm pretty proud of this work.

And in parallel during this time, I also tried to come up with some tools of my own to deal with messy data. So this is my unhead.r package, the first package I developed. It was, I think, released initially in 2018. It started as a package for cleaning messy datasets, and later on, I added some functions to expose the formatting in spreadsheets and just glue it to the cell values like we saw earlier.

And this is possible because I leveraged existing tools built by other very clever people, and because the formatting is there in the spreadsheet files. It's just hidden in underlying XML that we can parse and fiddle with, and then we can just expose that formatting and have it at our disposal.

And I guess through my work with the package and with me ranting or blogging about spreadsheets in my blog and social media, I was invited to contribute to this piece. It was a nice little article about coming up with tips for better spreadsheets from a data science perspective. And well, of course, when I'm quoted in this text, I made a clown out of myself about using color coding and then forgetting my color codes. But anyway, I just want to mention this because promoting and coming up and using good practices, it's still an ongoing topic. This is a follow-up piece from last month. So again, more things to consider before we get into spreadsheets.

And also in parallel, beyond just good practices and promoting things and trying to educate people, for the past 10 or 12 years, a lot of people in this community and in R and data science and programming in general have been developing and improving and maintaining some really nice tools that we can use for that. So now we can read spreadsheets, write spreadsheets. We can work with files that are in the cloud. And for the most part, there's a lot of very good documentation that we can learn from.

A real-world success story

And now let me bring together, or at least show you how this all came together for me with a very recent example in which using all this work, I was able to actually help others in a really satisfying way for me.

So earlier this year, a team of famous scientists that do some very cool research in Antarctica, they came to me, they have spreadsheets that look like this, except way wider, but just as colorful. And they had this nice little challenge. It said that on the days marked in yellow, we sampled two individuals from each aquarium. So they needed to know the sampling dates to account for the survival of the different individuals in a really high stakes experiment. And the information's there. So they recorded the sampling dates, except they recorded it through formatting.

And we could actually just go through the files manually. And okay, so that date has yellow, that's a sampling date. And we repeat that for all the experimental conditions and for all the days. And this experiment was many months long. So it's still doable, but not reproducible and prone to human error if we do it like that. And for once, I was able to tell this team, I actually know how to do it. I can start working straight away. And I'll email you the answer in 20 minutes.

And for once, I was able to tell this team, I actually know how to do it. I can start working straight away. And I'll email you the answer in 20 minutes.

So I'll show you what I, with a simplified example, how this works.

So if we have the spreadsheet that looks like that, we can just read it with Read Excel. And we get the cell contents. That's fine. The dates are mangled because, you know, spreadsheets. But we can fix that.

So if we use the unheader package I was talking about earlier, we can use this function called annotate mf, in this case, all, to work across all columns. And mf stands for meaningful formatting, which is another way to call this practice.

And we get something like I was showing you earlier. So we get the contents for the cells. And also, this little annotation in parentheses of which of these values had a highlight color. And that's the color code for yellow. So we have a table that looks like this, except way longer. We've identified which are the sampling dates, or at least which columns have the yellow in them.

Can do a bit of data transformation. So we reshape the data into a long format. Fix the dates, because we can fix the dates. And now we have a long format version of the data where we can actually now filter for the tags that indicate the formatting. If we have different colors, we could just filter through color codes and just arrive at what I was being asked, which is tell me the sampling dates for each experimental condition and each file.

So that was actually a proud moment of mine. And it seemed impressive to the team, but it was something I've been doing for years now, except for the most part, it was my own personal work. But I was happy to show this to other people, and then they incorporated that into their own analysis workflow, because they were doing survival analysis in R anyway. So that was pretty cool.

The 4GTS package

And it was all based on this simple principle of we import cell values, we import formatting, and we translate the formatting to strings and just glue it to the cell values. That gave me an idea last year, translate that instead to gt specifications, which led to the 4GTS package. All it does is converts a formatted spreadsheet to a gray table, or a gt object.

And that would look like this. The spreadsheet looks like this. I added some format, I added some color, I added underline, italic, bold. Read that file, and that there is an actual HTML gt object in my slide.

This is one piece of feedback I know of about the package from Kelly. It does work halfway. I've been using that for, I think, well, for almost a year I've been using it to create tables in reports and slides, and for teaching, so it works quite well. And a nice surprise, or update for everyone, is that thanks to the work of Fernanda Aguirre-Ruiz, 4GTS now also exists in Python. The logo is cooler, I think. We both work on it together. And it works as well, so this is code that actually was evaluated when rendering my slides. We get the same gt object. That's an HTML table, it looks pretty good.

So I'll wrap up saying that if we keep using and following and promoting good practices, and use existing tools, and share awareness of the existing tools, we can coexist with formatted spreadsheets, which are kind of like the fossil fuels of data, that they'll still be around for a while.

with formatted spreadsheets, which are kind of like the fossil fuels of data, that they'll still be around for a while.

And then I'll just leave you with the title and tagline for one of my other blog posts. Formatted spreadsheets can still work in R, it's not too late. Thank you.

Q&A

Or Chli, just clarifying that they missed something, which package do you use to extract formats from the spreadsheet? Tid.i Excel. Don't know if I mentioned that or? Just Tid.i Excel does everything for us, just need to parse the output from Tid.i Excel.

OK, wonderful. Next question, do you have tips on ways to politely educate colleagues? On the importance of formatting to record data?

Well, didn't you see the sad face in my blog post? It said, please don't do this, sad face. No, so it's I think the what seems to work is just empathy and being nice rather than being snarky. And maybe like identifying good resources like the paper I showed and just maybe print them out and leave them leave them somewhere. I don't know. It's what I've tried in my own work environments and it seems to be working. Just certainly scatter them on their desks. Yeah.

All right, thank you, Luis. Let's all thank Luis again.