Resources

What an Early 2000s Reality Show Taught Me about File Management - posit::conf(2023)

Presented by Reiko Okamoto Clutter, whether it's physical or digital, destroys our ability to focus; home organization ideas can be extended to create an workspace where analysts feel inspired to work with data. Ideas from home organization shows are surprisingly applicable to file management. Using a room divider to establish dedicated zones for different activities in a studio apartment is analogous to creating self-contained projects in RStudio. Likewise, swapping mismatched hangers with matching ones to tidy a closet resembles the adoption of a file naming convention to make a directory easier to navigate. In this talk, I will share good practices in file management through the lens of home organization. We all know that clutter, whether it is in our physical space or on our machine, destroys our ability to focus. These practices will help R users of all levels create a serene, relaxing environment where they feel inspired to work with data. https://reikookamoto.github.io/; https://github.com/reikookamoto/posit-conf-2023-neat Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Getting %$!@ done: productive workflows for data science. Session Code: TALK-1090

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Alright, so one of the most important parts of my late childhood and adolescence growing up in Canada was reality shows on cable television. And whether it involved 11 teams of two in a race around the world, or an up-and-coming chef fighting against an iron chef based on a secret ingredient, I just could not get enough of this unscripted chaos unfolding in front of me.

And during the early days of the pandemic, when I was spending a questionable amount of time on YouTube, one day I stumbled down a recommendation of Rabbit Hole and reunited with another show that I remembered watching growing up called Neat. Now Neat was this dramatic makeover series that was on home and garden television in which a professional organizer named Helen would help a person or family put cluttered living spaces, devise solutions to better organize their lives and possessions more effectively.

Now as someone with experience performing data analysis in both academia and the public sector, I just could not help but to notice that many of the principles of home organization are actually applicable to file management. So for example, whether it's physical or digital, clutter in our space is usually clutter in our minds as well. And more importantly, clutter shouldn't be taking away our focus and creativity, nor should it be blocking our way to success.

That being said, unfortunately, tips about file management can often be presented as this mixed bag of ideas without a unifying narrative. Or even worse, file management itself can be entirely overlooked in favor of arguably more exciting topics like machine learning and data visualization. So my goal over the next 15 minutes or so is to draw comparisons between home organization and file management to better illustrate that keeping your space clean and tidy is going to save you a lot of time down the road.

So my goal over the next 15 minutes or so is to draw comparisons between home organization and file management to better illustrate that keeping your space clean and tidy is going to save you a lot of time down the road.

Creating dedicated zones

So as I was watching these clips uploaded onto YouTube, I noticed that the show often featured a client that was struggling to declutter either their living room and dining combo or studio apartment. Now what was common to these spaces was that without distinct rooms separated by walls, boundaries became blurred and as a result, it was really unclear as to where one activity ended and another one began. For example, I'm talking about a dining table potentially turning into a secondary office desk. Or an overflowing box of business receipts somehow sitting on a kitchen counter.

But to address this issue, the host would begin by moving existing furniture or adding these room dividers to create dedicated spaces that were conducive to a specific activity. So I'm talking about a space for eating and another space for working.

And I think the digital analog to this kind of clutter is a documents folder that has just become this dumping ground for ad hoc analysis, plots, meeting minutes, and so much more. And we can begin to address this mess by organizing our work into self-contained projects where each project is a folder that contains all the files relevant to a particular piece of work. So this involves the input data, any R scripts, and their associated results. And furthermore, any R code that's going to be sitting in this folder, you're going to write it as if it will be run from a fresh R session with its working directory pointed to the project's root.

And this method of compartmentalization conveniently allows us to move our project or folder from one location on our computer to another or onto an entirely different machine while ensuring that the code is most likely going to work in its new location.

And I think we can further elevate this method of compartmentalization by using RStudio projects. So if you're new to RStudio projects, when a new project is created, either in a brand new or existing directory, RStudio creates a project file within the directory, which is then used to store the settings that are specific to that project. So if we ever want to return to the project after taking a small break, we can just open up that project file.

And in the IDE, we'll see that the current working directory is set to the project directory. The file browser is pointed at the project folder. Any previously edited files are restored into the editor tabs. And other RStudio settings, such as active tabs and splitter positions, are going to be restored to exactly where they were the last time the project was closed. And furthermore, a new R session is started.

And this aligns with our idea of self-contained projects that we discussed earlier. And this setup makes it easier to jump back into a specific analysis after you've taken a break from it, or to juggle multiple analyses at once without disorder.

And to tie this all back to home organization, I think it's similar to how it's a lot easier to go from practicing the piano to cooking dinner when you have established dedicated spaces for each activity, rather than having to move dirty laundry off of the piano or that box of business receipts off of your kitchen counter before doing what it is that you actually want to be doing.

Letting go of clutter and version control

And at some point in each episode on Neat, the host, Helen, often encouraged the clients to let go of duplicates and items that they're just holding onto just in case to reduce clutter. And understandably, on the show, some people found it challenging to let go of these sentimental items. You know, this could have been like a preteen refusing to part ways with their collection of stuffed animals, or a parent going through outgrown baby clothes and really trying to carefully choose an item from each milestone.

And I think as data analysts, we can cling on to files in a similar fashion, especially because data analysis often involves iteration and revision to go from this initial idea to the final deliverable. And without a good system in place for tracking how something evolves over time, projects can easily turn messy, as you see up here.

And to address this mess, one thing that we can do is to adopt version control. And this gives us the option to revert selected files back to a previous state, compare changes over time, see who modified something and when, all without having to keep every version of the file, further reducing the amount of visual clutter that we have on our machines.

Directory structure and README files

And on the show, once the unnecessary clutter was removed from the home, the host would then set up an organizational system that was suited to the client's lifestyle and the possessions that they held. Specifically, when using storage bins, the host, Helen, would make it really easy to find things in the future by leaving no doubt as to what was inside each one of these containers. So we're talking about using transparent, clear containers or using the label printer and printing out a label for each one of these boxes.

And in a similar fashion, I think as analysts, it's crucial for us to adopt a directory structure that matches the type of project that we're working on. So for an ad hoc analysis that I have on the left, that might mean creating a subfolder for raw data, another one for the code used in the analysis, maybe another one to keep the respective outputs. Meanwhile, on the right-hand side, for an R package, the package development conventions may encourage us to create a subdirectory for function definitions, another one for unit tests, another one for vignettes, and so on.

But regardless of what kind of project we're working on, whether that's ad hoc analyses or R package, going back to this idea of leaving no doubt as to what's inside each container, we should always consider adding a README to the project route, which tells other people why our project is useful, what they can do with it, and how they can use it. And if there's a lot of content to squeeze into one README file, I think that invites us to consider whether it's appropriate to place additional README files in the subdirectories.

And speaking from personal experience, at my previous workplace, we had a folder on a shared drive that contained many, many, many ad hoc analyses, or many, many RMD files, and each file was for a particular piece of ad hoc analysis. And a colleague of mine suggested creating a lookup table, which summarized the content of each file to make it easier to refer back to the previous analysis when needed.

So I took her suggestion, and I wrote a code snippet to automatically parse the YAML metadata from each RMD file that was sitting in the analysis subdirectory to generate this lookup table, which was then embedded into the README file of the analysis subdirectory.

So here, I'm sharing this reworked version of the code that I just talked about. So this is a reworked version of the code for Quarto documents. And I'm beginning by defining a function with the help of the Quarto packages, Quarto inspect, to retrieve the metadata of interest for a given file. And then I map this function that I just defined, extract metadata, to all the QMD files found in a given directory, to generate this table that you see on the bottom, that allows me to quickly inspect the file name, the title, and the subtitle that's sitting inside the YAML metadata component of each Quarto document very quickly.

And if you want, you can then add this table to the README.md of a particular subdirectory that you might want to provide more information on.

File naming conventions

And once again, going back to the show, so once the host helped clear the clutter and figured out what kind of organizational system was best suited to the client's lifestyle and possessions, she further elevated the space by ensuring consistency. And I think a great example of this is using matching hangers to reduce visual clutter and to make a closet easier to navigate. And this can be a little bit more difficult when you have a mix of metal and plastic hangers, and maybe some of those hangers that you get for free from the dry cleaners.

And in the world of file management, I think this messiness resembles not having a proper framework in place for naming our files. But through adopting a file naming convention, we can further reduce the amount of visual clutter on our machines and also make it easier for us to find things in the future. And Jenny Bryan, who is also at this conference, I believe has shared some great resources on file naming conventions. So for the next little bit, I'm going to be drawing from her expertise, and I'll share the three features that constitute a good file name.

First, a good file name should be machine readable. So we're talking about avoiding special characters, avoiding spaces, and also being mindful of case sensitivity. And this is also that our file names can play well with our machine's inherent search functionality. And here I'm focusing on the intentional use of underscores to allow for the easy recovery of metadata using glob patterns or regular expressions down the road.

The second feature of a good file name is that it should be human readable. And here I'm using hyphens to delimit words to make the file names a little bit easier for us to read. And I'm also focusing on using descriptive and non-generic keywords. This is so that I can provide somewhat of a preview in terms of the content and purpose of the file.

The third feature is that the file name should allow for the files to be arranged in a useful way while assuming alphanumeric ordering. So here in the top half of the slide, I'm using the ISO 8601 standard for dates when I want my files to be arranged in chronological order. And in the bottom half, I'm making sure to left pad the numbers with zero when I want my scripts to be in logical order. Or the order in which they're called in an analysis workflow. And left padding with zero prevents situations like the number 10 written as one zero appearing before two in a list.

Recap and closing

So we covered a lot of ground, so I think it's a good moment to recap what we've discussed. And I hope over the past 15 minutes, I was able to convince you that by applying techniques I shared on an home organization show, such as creating dedicated zones, reducing redundancy, setting up customized storage solutions, and pursuing visual consistency, you can create a serene, relaxing workspace that facilitates reproducible analysis, is inviting to other collaborators, including your future self, and exudes professionalism.

And I hope over the past 15 minutes, I was able to convince you that by applying techniques I shared on an home organization show, such as creating dedicated zones, reducing redundancy, setting up customized storage solutions, and pursuing visual consistency, you can create a serene, relaxing workspace that facilitates reproducible analysis, is inviting to other collaborators, including your future self, and exudes professionalism.

So whether you're an instructor looking to revise their file management lesson plan, or maybe you are someone who wants to inspire their colleagues to reconsider file management practices in the workplace, I hope this narrative provided you with some inspiration today. Thank you for listening, and I hope you enjoy the rest of the conference.

Q&A

How do you encourage ongoing good file management?

One thing that I caught myself doing today, I was trying to upload a copy of this slide to my GitHub, and I realized that the default file name for the PDF wasn't so good. And I was really tired, because it was like 9 o'clock last night, and I was like, I'll just leave it as is. But then I kind of thought to myself, I should be a good role model and probably give it a better file name. So I think it starts by you applying and demonstrating what you are trying to convince your colleagues in terms of making them do exactly what it is that you want them to follow.

So in a home, there's a need for a drawer that is messy so that everything else can be un-messy. Is there a similar idea for files?

Yeah, in the past, I think involuntarily, we've ended up in situations where folders like that have existed. But then I think you also then have to be responsible for setting up a reasonable timeline as to when you want to go back and re-sort, delete things from that dumping ground folder to really keep it as a temporary solution as opposed to something that is going to consistently live within your team's workspace.