
Eric Cronstrom | How we made the switch: a case study on automating a complex report | RStudio
The Center for Charter Schools at Central Michigan University produces annual reports for about 60 schools. The reporting process used to be a cumbersome blend of many technologies. The Center used to use a blend of SQL, Excel, inDesign, and VBScript that would all culminate in a nice looking, branded report for each school. 2 years ago, staff turnover allowed the data team to rethink the process, having had experience in R from graduate work the team at the Center decided to go all in on R Studio and R Markdown for report production a mere 1 month before the reports were due to be published. This talk will be a case study of how we as an organization adopted RStudio tools to streamline a cumbersome process to fantastic results. About Eric: Eric is responsible for administering a wide range of day to day program functions associated with the performance and accountability of CMU partner schools. He ensures that the data the Center utilizes to evaluate school performance is accurate and stored within a sound data infrastructure. He also leverages his wide range of technical skills to lead the development and production of reports and respond to questions regarding school performance and demographic context. Prior to joining the Center, Eric was a database administrator for Central Michigan University Libraries. He has also served as a lecturer at Central Michigan University teaching courses in web design, database design and programming. He earned a master’s degree in information systems management and a bachelor’s degree in computer science from Central Michigan University. He is also pursuing an additional master’s degree in applied statistics at Central Michigan University
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
My name is Eric Cronstrom with the Center for Charter Schools at Central Michigan University. I'm going to tell you a story about my first project after joining the Center.
Now, this project was a challenge, but we were able to automate a process that had been performed manually for more than a decade. On its own, the project was successful, but we turned it into an opportunity to rethink how the organization approaches data science.
This is a story about the technology and the techniques we used to automate a task that seemed too complex for procedural document generation. But more importantly, this is a story about how to make a difference in an organization, about how to build trust and convince leadership to try the new thing.
The Annual Performance Report
One of the responsibilities we have as a charter school authorizer is to hold our partner schools accountable for the academic performance of their students. One of the ways we do this is with the Annual Performance Report.
The Annual Performance Report, or APR, is a document that summarizes how well students in each of our 58 partner schools are doing academically. In the spirit of transparency, these reports are made publicly available on the Center's website, and they're also distributed to school boards and other stakeholders as print booklets. The organization demands high production values for these documents, and they have to be absolutely correct. And to make no bones about it, they look good. And these publications are the primary responsibility of my new job.
So try to picture yourself at a job interview. Across the table is your new potential boss. He's an impressive guy with a PhD in psychometrics. He's explaining how important these reports are and some of what goes into making them. But there's a catch. These reports that usually take months to produce have to go to print in seven weeks. Well, to quote Richard Branson, if someone offers you an amazing opportunity but you're not sure if you can do it, say yes and learn how to do it later.
The old process
So how do you make an APR? First things first, you have to get your data. There are a variety of different sources. Individual test scores from assessment vendors, student information from the Michigan Department of Ed, demographic and comparison data from the state and other public entities, and so on. Once you download everything, you load it into the database. Sometimes. Some of it just stays as flat files.
After sourcing all your information, you can start producing the visuals with Excel. A school can have up to 23 different visualizations depending on the grade levels offered. And there are nine different Excel templates that produce all of the charts and graphs and tables required by the APRs. At this point, any data that actually made it into the database gets pulled back out and copied into an Excel template. Then you manually output the correct visuals for each school, validating the numbers as you go. One at a time for 23 visuals per school for 58 schools.
Now the last link in this chain was Adobe InDesign, where the actual documents get produced. There's a master report template that gets updated with fresh content every year. And finally, one last script pulls everything together to produce one PDF for each school.
Deciding to try something new
This is where being a new person helps a bit. Hiring a new team member is an opportunity to have a fresh set of eyes, who, like the proverbial boiled frog, knows right away when they're dropped into hot water. Our tight deadline was also a blessing in disguise, since we knew there wasn't enough time to do things the way they had done before.
So with only six weeks left at this point, we sat down in the conference room and hashed out our plan. This is a picture I took of the whiteboard after that meeting. We figured we had three different potential paths forward. A fully automated solution built on R and rendering documents with R Markdown, or a hybrid solution where we render the visuals with R, but still use InDesign to pull the documents together.
I should note that up until this point, nobody on the team, myself included, had used much R. Excel was the tool of choice for data visualization, and most of any real analysis was performed in SAS. So we gave ourselves only a short amount of time to experiment with R. And if we couldn't have a proof of concept for production-ready visuals in 10 days, we would use Excel. And if within a month we couldn't have an R Markdown report that met our publication standards, we would use the InDesign templates.
Like I said, I hadn't used much R up until this point. I was comfortable with programming. My background is in software development and automation. I had some exposure to R just through my graduate coursework, but the only R Markdown reports I ever created were academic white papers. And looking back, I think it was this discussion about the risks involved and being open and honest about what I knew and what I didn't know and having a backup plan is how I was able to convince the team to try something new.
And looking back, I think it was this discussion about the risks involved and being open and honest about what I knew and what I didn't know and having a backup plan is how I was able to convince the team to try something new.
Building the new report
Now that we had a plan and we convinced the organization to let us experiment, we set out to recreate this report from the ground up. And the first thing we needed to do was prove to ourselves that we could create a publication that adheres to our identity standards using R Markdown. And there was one issue that almost stopped us right away.
So at this point in my career, I had been writing code professionally for 10 years. I've taught web development courses. I've taught database design at a graduate level. The single issue I spent the most time on for this report was getting a box with our CMU maroon background color with white text using a custom font in LaTeX.
Maybe that just stands out as the most difficult technical challenge because almost everything else was done with ggplot and ggplot is amazing.
Regardless, it's not my proudest moment. More seriously though, the reason this process had been performed manually for so long is that each school is a little different. Different schools offer different grade levels. Some of our schools are K through 12. Some are just high schools and some are kindergarten through third grade. And beyond that, we evaluate the academic performance of our schools based on the terms laid out in their individual contracts. The educational goals of many of our schools are similar, but some are operating under unique circumstances and all of those differences need to be represented in the annual performance report.
When you're manually producing visuals in Excel, you can tweak things on the fly to accommodate all those variations. Trying to automate around all that variability presents an interesting design challenge. There are two main strategies we employ to accommodate this, depending on the prevalence of the variation and the impact that it has on the report. You can always write flow control logic into your R code so your program will behave differently depending on what school's report is currently being rendered. This is okay if it's a unique situation that only impacts one or two schools, otherwise it can get out of hand quickly.
What we try to do first is to generalize an issue and develop a data structure that represents all possible variations. It effectively turns part of a school's contract into data. A good example of this is you can have two schools that both offer instruction at grades K through 8. School A's contract stipulates they're to be evaluated on the academic achievement of their students in grades 3 through 8. School B's contract says they are to be evaluated based on their students in grades 3 through 7, excluding grade 8. Instead of a giant if-else chain in your code to keep track of all the contract grades, you can develop a data structure that represents that information.
Meeting the deadline
Going through this process, we identified more problems than we were able to fix right away. When you have an aggressive timeline, you do what you have to do to get the job done. To be fair, we only gave ourselves six weeks to learn ggplot, markdown, and LaTeX. Our code was never going to be pretty, but I'm proud of what we managed to accomplish.
We met our deadline with a couple days to spare. The new fully automated reports look great. With the push of a button, an R driver program renders one report per school from a single R markdown template. And the best part is it's completely reproducible. Next year, we can push the same button again and get the same reports with updated data.
That is, of course, until your communications department updates their brand identity standards with new fonts and new logos and new rules. Such is life.
Building on success
So we sent our reports out to print on time, but we knew we weren't done. We used success with R in this project to convince the organization to invest more into our data infrastructure. We just wrapped up the implementation of our RStudio Connect server, and we have time carved out for research and development of dashboards and other analytics projects. My next big project is a Plumber API that will feed data to a web app hosted on our website. Gone are the days of juggling flat files between projects. Now I access data in my SQL server directly using R with dbPlayer, which I'm convinced works on Blackmagic.
Showing results is what convinced the organization to give us the time to invest in ourselves. We've started adopting best practices around data management and software development. Now all of the data that we use makes it into our data warehouse, and each data pipeline has an automated integration services package. Because as it turns out, Excel was our ETL tool as well as our DataViz tool. And all of the code and assets needed to build each solution are stored in one spot in source control. And we're able to spend more time building and doing work that we enjoy because exporting hundreds of charts from Excel one at a time is super boring. And we're able to spend more time contributing research and at a time when research and education is critical.
Showing results is what convinced the organization to give us the time to invest in ourselves.
I hope that our story helps inspire you to take on your next big project. Anyone, regardless of tenure or position or job title, can be a leader. You just need to take the first step.
