Dean Marchiori | A retrospective on a year of commercial data science projects in R | RStudio

Full title: How reproducible am I? A retrospective on a year of commercial data science projects in R Reproducibility is a critical aspect in science to enable trust & communication. In R, many tools exist to bring in the best practices of reproducibility into the hands of data scientists. However, outside of a research setting, how does reproducibility hold up in commercial data science projects? In this talk I take an honest retrospective of my own commercial R projects in the last year. I look at the various types of analyses completed, and which workflows were selected and why. Through this process we can learn how workflow choices may help in the short term but hinder in the long term. More importantly what can be done strike the balance between progress and perfection when doing data science in the wild? About Dean: Dean Marchiori is a Statistician based in Sydney, Australia. He currently works with Endeavour Energy as a Senior Data Scientist modelling bushfire and vegetation risk on the electricity network. Dean's career started in finance as an equities trader before moving into advanced analytics, where he has worked with some of Australia’s largest organisations. His professional interests are in geospatial analysis, time series modelling and the R programming language. In 2019, Dean was named one of the top 10 analytics leaders in Australia by IAPA. He is also recognised as an Accredited Statistician with the Statistical Society of Australia. He holds a Bachelor of Science in Mathematics (awarded with University Medal) and a Masters degree in Applied Finance. Outside of work Dean enjoys bodysurfing, running and spending time with his wife and two boys

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

G'day, I'm Dean Marchiori. I'm a statistician from Sydney, Australia, and I'm interested in reproducible analysis and workflow choices in R. But in my day-to-day practice in industry as a commercial data scientist, is my work actually reproducible? To answer this, I conducted a reproducibility retrospective, or ReproRetro. I went back through the last year of my commercial data science projects, and rated each project across a number of dimensions that are valued in my practice.

You can pick aspects that suit your work, but the dimensions I picked cover if an analysis were replicable, modular, auditable, automated, and collaborative. So what were the results? Out of 55 projects, I scored an average of 3.8 out of 5. This is good, but not great. So if I care so much about this, why wasn't my work rated higher? Here are the lessons I took from my ReproRetro.

Lesson one: not everything needs to be reproducible

Controversial opinion number one, not everything needs to be reproducible. While all projects benefited from some aspect of reproducibility, the most important consideration for me was, how do you tell what needs to be reproducible?

Enter my Fiji test. So if you have a task that's sufficiently important, you may want to consider the Fiji test. Aussies like Fiji because it's only a few hours away, but if you're lying on a beach in holidays in Fiji, just spare a thought for your colleagues back in the office. If they need to rerun some of your analysis, will they be able to find it, understand your work, run it without too much fuss, and make reasonable enhancements or small changes? The last thing you want while you're sipping your pina colada is to be disturbed by a weird phone call about some model that you trained.

The last thing you want while you're sipping your pina colada is to be disturbed by a weird phone call about some model that you trained.

When I looked at the type of analytics task I completed, not everything needed to pass the Fiji test. A lot of work in industry is ad hoc and quite simple. Commercial or a high pressure environment, I think it's perfectly reasonable to have different standards on workflow choices depending on the type of work and the return on investment that you need. So long as a reasonable standard is maintained and as long as your team is consistent with these choices. However, this may mean getting better at anticipating the full scope of the project before rushing into solving a problem, which can be a challenge. In addition, it gives you the freedom to pick a more lightweight and flexible workflow choice if desired.

Picking the right tools for the job

My next lesson was picking the right tools for the job. You know, I've been searching for the biggest and best workflow choice, but I found it paid to have options. When I examined my tooling and workflow options, the results of my repro retro varied. Using a basic directory structure template to organize my projects was okay all round, but quickly got cumbersome. Notebooks like R Markdown were nicely replicable, but tended to be a bit of a dumping ground for code. Drake, a more formal workflow system, carried a bit more overhead and more admin to get used to, but overall it enforced much better standards, particularly around writing functional and modular code.

While any system can be made arbitrarily sophisticated and do a good job, I found it more effective to have a few different workflow choices available rather than try a one size fits all approach. All workflow choices are naturally optimized for certain applications, and I found it better to only take on additional complexity when the job really needed it.

Replication vs. reproduction

Next, I found in my work I had a tendency to mistake replication for reproduction, and the same goes for automation. Getting your code to rerun is only part of the solution, so consider the factors that matter in your practice and be deliberate about what you optimize for, and use that to guide your workflow choice. And how do you do that? Well, do a repro-retro. The process wasn't and isn't intended to be scientific or unbiased. The true value of a repro-retro is in the self-reflection and improvement. So go and dig through your old code and think about how you might do things better.

The true value of a repro-retro is in the self-reflection and improvement.

If you want to learn more, you can reach out to me or visit my AnalysisFlow GitHub repo. I'd also like to acknowledge and thank my collaborator Myles McBain for many interesting discussions in this space. Thank you and good luck.

Featured software#