Resources

George Kastrinakis | Building a data science pipeline for the FT with RStudio Connect | RStudio

Talk from rstudio::conf(2020) We have recently implemented a new Data Science workflow and pipeline, using RStudio Connect and Google Cloud Services. This has vastly decreased our pipeline complexity, allowing us to bring our models and products into scheduled production more quickly. In addition, our workflow, working closely together as a team on all projects on a regular two-week sprint cycle, has increased the range of projects we have been able to take on and complete. To detail some of the key lessons we’ve learned (and some of the difficulties!), we’ll walk you through one of our recent sprints, where we productionalised the generation of a suite of behavioural and demographic features so that they can be more easily plugged in to a range of models and used across the business by the FT’s platform and product teams

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello everyone, my name is George. I'm a data scientist at the Financial Times based in London and today I'm gonna walk you through our current pipeline that we developed using RStudio Connect and then try and compare that pipeline with the previous one we had and see what we have changed and how we have improved.

So we're gonna start with a brief introduction to the Financial Times and the team. We're gonna then move on to describing what data science means for the Financial Times and what types of problems we're trying to solve and then we're gonna have an overview of our new pipeline and the previous pipeline, all of the technologies involved and a point-to-point comparison between the two. Finally we're gonna go through our agile practices and see how Connect and our new pipeline have helped.

The Financial Times and the team

So the Financial Times, as you might know, is a newspaper. It was first published in 1888 in London, where we are still based, so the main offices are in London. The Financial Times is operating on a subscription model, which means any of our users needs to have an active subscription, either a company or an individual.

So in terms of data, we have all of the subscription data, both for individuals and companies, which is really nice if you want to aggregate things or track events and users over time. We have behavioural and demographic data, as you would expect, and we track events both from our site and our app. So all of that data allows us to do a lot of analysis and develop different data science models that we then provide to different teams within the company, so we have many different stakeholders from across the business.

So moving on to the team, currently there are six of us. Back in 2017, there were just two of us, so we have managed to triple in size over the past two and a half years, and we are now planning to double again over the next couple of months. So the team is Adam, Cloudy, Grace, Ollie, Simon and me. It's really interesting how with wide backgrounds we all have, though we all have one thing in common, which is none of us is a software developer, and as you will see in the decisions and technology we're using, this has driven some of the decisions on what tools we use and how we use them and what the processes within the team are.

Data science at the FT

So data science in the FT, if we wanted to summarise that into just a few points, we would end up with three points. So the first one is that we are trying to create models that allow users to become more engaged with the product we offer, either the on-set products or the app, and we usually try to do that by adding more personalisation to the user journey, so either by promoting specific types of content or different products that the user might have not interacted with in the past. A second point is that we try to always optimise and make all of the different parts of our acquisition funnel to be better, so we get better at acquiring new subscriptions and maintaining those subscriptions, and finally we develop a set of different metrics to make sure that the health of the business looks good.

So from things like RFE, which is our key engagement metric, which is based on recency, frequency and volume of reader activity, to LTV, which is our lifetime value calculation that we use across our user base.

So on top of those three goals we have developed a few different models, as you would expect, so things like next best action and topic recommendations, so we try to identify products or things that the user hasn't done on-site or on the app recently and we try to send them a reminder or just a promotion for that particular product, and also in terms of recommendations we provide content, personalised content and personalised topics for users to interact with. We make predictions for the volume of new subscriptions based on past business performance, we have attribution modelling where we try to identify marketing campaigns that are real drivers of acquisitions and growth for the business and then use that knowledge to make better decisions for where to spend our marketing budget. We segment users in many various different levels and then we use those segments for analysis or for data science models.

Finally we have a set of models around identifying potential leads for B2B contracts, so if we identify a user that is a decision maker or someone that is coming from a specific company that we know used to have a contract with us or maybe has one and we want to grow that, we can then ask them to move on to a B2B account.

Sprint cycle and agile practices

So into our sprint cycle, as I said in the beginning we use agile principles to some extent, so we split our time into week sprints, we have our planning and prioritisation meetings a few months before the actual start of the sprint where we define requirements and objectives and dependencies with other teams or specific data points. We involve all of the relevant stakeholders in that discussion to make sure that we are doing things that are relevant for the business. Then we have a kick-off meeting just before we start working on the sprint where we have a more detailed discussion on exactly how we're going to do all of the objectives and how we're going to go about fulfilling all of the requirements. We create cards for our dashboard and then make sure that all of the existing dependencies have been met. Then we hold daily stand-ups where we again meet with all of the relevant stakeholders, we show results from early analysis we have done, we ask for feedback, we make sure that we are going in the right direction and then in the end of the sprint we have a retro where we take note of things that went well, things that didn't go as well and next steps for future iterations on the same project.

The old pipeline

So now let's actually go and see what the current and the old pipelines look like. So back in 2017 we decided we wanted to create our own custom pipeline so we found a tool or a framework called Luigi which is written in Python and it's a really nice tool if you want to create a custom pipeline for any language really. So before that we used to have a big dependency with our data platforms team so anything we published would go through their platform so we wanted to move away from that, that's why we decided to go with Luigi.

So at that time the pipeline was hosted on AWS and Amazon services so all of the data was in Redshift and we would have to create a separate SQL script for any query we were running against Redshift. We would then have to create Docker containers where we would have all of the R scripts and so we load the data from Redshift, do any data wrangling, load the model, execute the model within the Docker container and then send the data back to Redshift and back it's in S3 and on top of that we would have Jenkins to make sure that we schedule the jobs to run whenever we need we needed them to and I think I forgot to mention so Luigi is written in Python and the way that Luigi works is you need to provide the dependencies and the order in which the steps need to run and we do that by providing a Python script.

So as you can see there are a few different technologies, we used to write SQL, R and Python so it was hard to extend, hard to maintain and really hard for new joiners to get up to speed with but it was really successful until we started to grow as a team.

The new pipeline with RStudio Connect

So then we found out about RStudio Connect which is now at the core of our pipeline. We also now moved all of our infrastructure into Google which was not a decision of the data science team, it was a decision from the data platforms team but again it proved to be successful and we're quite happy with that as well. So now the data comes from BigQuery and then we go straight into RStudio Connect where we run any R script, we load any model, execute the model and then store the outputs in BigQuery and buckets in GCS.

So as you can see here all of the code required is just in R so we can have interactions with BigQuery or the buckets with using the relevant packages in R so it's really nice to maintain, really easy for someone that knows how to code in R to get up to speed with and we're quite happy with it until up to now.

Comparing the two pipelines

So if we want to compare the two pipelines, so we said that now we only have R whereas in the past we had three different languages so that's definitely a benefit. We removed a lot of the complexity involved which was not really necessary as it seems. We can now publish and schedule things in just a few minutes whereas in the past it would take around a day and if you were a new joiner maybe a week. We can make the most out of Shiny dashboards and Shiny apps and just deploy anything we want really easy and also there is great integration to our team's capabilities. As I said we're not software developers so maintaining a pipeline that involves Docker and Luigi and interacting with AWS was much much harder than what we have now and as you will see also in the end, the current pipeline fits greatly with all of the agile processes we have.

We can now publish and schedule things in just a few minutes whereas in the past it would take around a day and if you were a new joiner maybe a week.

Setting up RStudio Connect

So now we can briefly go through what setting up RStudio Connect involves. So as I said we are using Google but it would be more or less the same process if you are using Amazon or any other cloud provider. So for us it was just a matter of going to the console, creating a new instance which can be a very small Linux server in the beginning and then you can directly SSH onto the server, install all of the basic requirements like a version of R, a version of Java and RStudio Connect itself. You can also create a server-wide data store if you want to allow for data to be loaded between different jobs or different runs of the same job. And finally you just need to configure buckets and BigQuery, create the relevant credentials and then use those inside RStudio Connect and that's it.

So in the beginning it might have taken us a couple of days to get everything configured and create the proper documentation. If we were to replicate the server now it would probably take just a couple of hours.

How RStudio Connect supports agile working

So finally moving back to our sprint cycle and how RStudio Connect and the new pipeline have made us better at working in an agile setting. So creating markdown reports and providing results from any early analysis we do is really easy now. We can move between research and development to deployment whenever we need to. We can deploy multiple versions of the same model. Again that's really really easy and on top of that we can create dashboards or alerts to make sure that all of the models that we have look fine and there is no performance drop or any failure in the scheduled runs.

So as an overview we went quickly through those. So we saw how RStudio Connect is now part of our pipeline and how it has improved the way we do data science. We briefly described how we configured and set up our new pipeline using Google infrastructure. We saw how the deploying and scheduling markdowns and dashboards have made us better in working in an agile setting and finally how this also allows us to make sure that stakeholders are engaged and we can welcome feedback throughout the sprint cycle. So that was it for me.

Just a reminder we're also hiring and as I said we're trying to double in size. I'm not sure how relevant this is because most positions are going to be in London but feel free to follow the link and contact either the team or me directly. Thank you very much.

Q&A

So we have a couple questions that came through. The first one is what parts of the pipeline are not replaced with RStudio Connect?

All of the R logic and the interactions with the database or the buckets or how we execute models. So it's yeah the things that were replaced that might be easier to answer. So we replaced Jenkins and the monitoring and the dashboarding with all of the capabilities of RStudio Connect.

Another question from Glenn. How do you handle shiny dashboards that require querying large amounts of data that take considerable time? So we do use, we do cast the data. So if there has been a more recent run within the same day or within the same whatever period we want to update the dashboard in, it's faster if it's not the first time because we use the cast data. If it's the first run and the query time is quite long then the dashboard will take quite a long time to load.

And one last question. So is there something intrinsic to use RStudio Connect that motivated the switch from S3 to the Google Cloud, well GCS? No, so the move between Amazon and Google was completely coming from our data platforms team because I think there are packages to interact both with AWS or GCS.

We should have time for one last one. So what type of challenges have you had with scheduling jobs in Connect? I don't think we had any issues with scheduling on a time schedule. One thing we would really really like to have is to have trigger based runs of models. So if we could have some rules under which the model would run instead of just specifying that the job needs to run every hour or every day, to have maybe another job to go in BigQuery, check a field, if the field had changed then run the model. Does that make sense? Thank you.