Bold indicates negative? (Luis D. Verde Arregoitia, Instituto De Ecología, A.c.)

Transcript#

This transcript was generated automatically and may contain errors.

All right, how are you, everyone? And thank you for being here.

So, let me start with this post from social media that many of you might relate with about offering to help someone with their data. And then we encounter something slightly ridiculous, in this case, bold equals negative.

Literally meaning that in a data set or a table, instead of actual minus signs, the negative values were indicated with bold text. That happens.

So, we start to wonder, who does that? Right? Who's doing that?

But before we get to that, let me define what that refers to.

And that refers to encoding data as formatting, which is this very common practice of usually storing and sharing data, often in spreadsheets, and then using formatting to intersperse and layer data where it probably shouldn't be.

So, let's look at one example. This is from the real world, just a little simplified, where we get the classic bold indicates negative values for this permeability variable, and also some cell colors, because why not?

Let's look at another example where we have different colors for the text being used to indicate the units for the variables. So, you have black for pounds, blue for kilograms, and then we have further information about one of these variables, the one that says microchipped, encoded as the cell color or cell highlight color or background color, whatever we call it.

So, now we can start to wonder, who does that? And the answer is, a lot of people, because there are over a billion spreadsheet users out there in industry and research and government. And anytime anyone has looked at public collections of spreadsheets, myself included, between 25 to 60% of public spreadsheets use either cell or text formatting or both, and it won't always be used just for emphasis or decoration.

So, a lot of the time, with so many spreadsheets out there, data will be encoded as formatting. It's just hard to know, or it's hard to identify that programmatically, so I can't have that figure for you.

And this is happening partly because we go into data analysis with the tools we know or the tools we know about, not necessarily the tools we need, and a lot of the time, this happens to be spreadsheet software.

And for once, I was able to tell this team, I actually know how to do it. I can start working straight away. And I'll email you the answer in 20 minutes.

So I'll show you what I, with a simplified example, how this works.

So if we have the spreadsheet that looks like that, we can just read it with Read Excel. And we get the cell contents. That's fine. The dates are mangled because, you know, spreadsheets. But we can fix that.

So if we use the unheader package I was talking about earlier, we can use this function called annotate mf, in this case, all, to work across all columns. And mf stands for meaningful formatting, which is another way to call this practice.

And we get something like I was showing you earlier. So we get the contents for the cells. And also, this little annotation in parentheses of which of these values had a highlight color. And that's the color code for yellow. So we have a table that looks like this, except way longer. We've identified which are the sampling dates, or at least which columns have the yellow in them.

Can do a bit of data transformation. So we reshape the data into a long format. Fix the dates, because we can fix the dates. And now we have a long format version of the data where we can actually now filter for the tags that indicate the formatting. If we have different colors, we could just filter through color codes and just arrive at what I was being asked, which is tell me the sampling dates for each experimental condition and each file.

So that was actually a proud moment of mine. And it seemed impressive to the team, but it was something I've been doing for years now, except for the most part, it was my own personal work. But I was happy to show this to other people, and then they incorporated that into their own analysis workflow, because they were doing survival analysis in R anyway. So that was pretty cool.

The 4GTS package

And it was all based on this simple principle of we import cell values, we import formatting, and we translate the formatting to strings and just glue it to the cell values. That gave me an idea last year, translate that instead to gt specifications, which led to the 4GTS package. All it does is converts a formatted spreadsheet to a gray table, or a gt object.

And that would look like this. The spreadsheet looks like this. I added some format, I added some color, I added underline, italic, bold. Read that file, and that there is an actual HTML gt object in my slide.

This is one piece of feedback I know of about the package from Kelly. It does work halfway. I've been using that for, I think, well, for almost a year I've been using it to create tables in reports and slides, and for teaching, so it works quite well. And a nice surprise, or update for everyone, is that thanks to the work of Fernanda Aguirre-Ruiz, 4GTS now also exists in Python. The logo is cooler, I think. We both work on it together. And it works as well, so this is code that actually was evaluated when rendering my slides. We get the same gt object. That's an HTML table, it looks pretty good.

So I'll wrap up saying that if we keep using and following and promoting good practices, and use existing tools, and share awareness of the existing tools, we can coexist with formatted spreadsheets, which are kind of like the fossil fuels of data, that they'll still be around for a while.

with formatted spreadsheets, which are kind of like the fossil fuels of data, that they'll still be around for a while.

And then I'll just leave you with the title and tagline for one of my other blog posts. Formatted spreadsheets can still work in R, it's not too late. Thank you.

Q&A

Or Chli, just clarifying that they missed something, which package do you use to extract formats from the spreadsheet? Tid.i Excel. Don't know if I mentioned that or? Just Tid.i Excel does everything for us, just need to parse the output from Tid.i Excel.

OK, wonderful. Next question, do you have tips on ways to politely educate colleagues? On the importance of formatting to record data?

Well, didn't you see the sad face in my blog post? It said, please don't do this, sad face. No, so it's I think the what seems to work is just empathy and being nice rather than being snarky. And maybe like identifying good resources like the paper I showed and just maybe print them out and leave them leave them somewhere. I don't know. It's what I've tried in my own work environments and it seems to be working. Just certainly scatter them on their desks. Yeah.

All right, thank you, Luis. Let's all thank Luis again.

Bold indicates negative? (Luis D. Verde Arregoitia, Instituto De Ecología, A.c.) | posit::conf(2025)

Transcript#

What we can do about it

Community tools and my own contributions

A real-world success story

The 4GTS package

Q&A

Featured software#

gt