Tom Schenk & Bejan Sadeghian | Making Microservices Part of Your Data Team

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everybody. Thank you so much for joining. Welcome to the RStudio Enterprise Community Meetup. I'm Rachel Dempsey. I'm sure I've met many of you before at meetups like this. Thank you for joining again. But for today's meetup, we will learn how the team at KPMG is scaling their data science applications across the enterprise and working with front-end development teams using microservices and RStudio Connect. So Tom will be kicking things off first. Tom is a researcher and author on applying technology, data, and analytics to make better decisions. He's currently a managing director at KPMG and previously served as the chief data officer for the city of Chicago. Tom will get us going here and then turn it over to Bejan. Bejan Sadeghian is a director of analytics at KPMG and leads data science development, which spans from advanced analytics to machine learning engineering. I'm so excited to turn it over to both of you. Thank you so much for joining us today.

Thank you very much. Thank you for having us here. It is exciting to talk about R in production. I'm a longtime R user and developer. It goes back to my time in grad school when Hadley Wickham and I were both at Iowa State in grad school. I was using early versions of Reshape and ggplot, well before ggplot2 . And so it goes all the way back to then where it was a statistical language and now seeing the language mature to talk about being used in production and going to talk about how we use R in production at KPMG.

So I just want to be clear. KPMG is a consulting firm, but this isn't about what we've done for clients or a use case that we're advocating for. This is how we are using R within our environment and particularly about using microservices and able to scale out data science across the enterprise and the reasons why we opted to go down the microservices route. And there's been meetups here and there's been conversations elsewhere talking about using Shiny and how to scale Shiny across the enterprise and the tactics and techniques to do that, which is absolutely a potential way of doing it. We've taken a different path by trying to use microservices to be part of the data science team.

And we're going to talk about why we went down that path, the benefits that it's provided us, some of the challenges that you should anticipate to see when using microservices, and absolutely giving you at least a couple of demos today to show how this actually works in production. So I'll hop back, I'll hop right into it and then Bijan will take over for a good chunk of the conversation and hopefully have a good opportunity for Q&A at the end of this presentation.

Challenges and trade-offs of a growing team

So as I mentioned, we're going to talk about a few things, six things in particular. What are the challenges of growing a data science team? I've built a number of data science teams at this point in the commercial sector as I do today. I've worked in the public sector when I served as chief data officer for the city of Chicago. So talk about the challenge and trade-offs of what it looks like to grow a data science team and why microservices start to enter the equation. Second, talk about microservices. What are they exactly? There's a lot written on the topic of microservices. I've got a couple of my favorite textbooks here that I have read around the concept of microservices. And then we're going to dive into a demo, a hollow world, very simple example of a microservices. Then talk about how to plan and design for microservices and take a look at a demo, an application demo, and then finally time for Q&A and a recap.

So the challenges and the trade-offs of growing a data science team, or what I call is like how I stopped worrying about hiring, because this is something as the team is growing, you start getting to these concerns about how do you scale the applications and how do you scale your team to work across one or many different applications? So as a getting good presentation, we're going to summarize this as a graph. And so there's always this trade-off that we have in our work as data scientists or software developers between complexity and what I just call being hackish in terms of trying to implement something. And we probably have all felt it. We're working on trying to do something that's more complex, but we're trying to do it without hacking around, being too clever on trying to implement a solution, trying to back in engineering something to make it work, because that creates long-term liability.

You might be the only person that knows how to deal with a particular solution that you've built. So in that upper left-hand quadrant there, you see that sort of programmer's lure where you do a lot of complexity, but you don't do it in a very hackish way. In the lower right-hand corner, that Rube Goldberg zone of doing something that's not very complex, but doing it in an absolute hackish way. And that 45-degree line is really that balance and trying to recognize, okay, how do we do more complex things again without being too hackish on the solution? And this is something that Bijan and I and our team think about quite a bit. How do we do our work well is essentially what this summarizes to. So we're going to talk about how microservices helps you do things with greater complexity or complex needs, but without going into too much of a hackish zone.

So this is something that we often see within growing our analytics tool set. So as you see in those big bubbles, we talk about the progression of user needs over time. So when you build a data science application or some sort of solution, you know, immediately you make folks happy. You make your data scientists happy. You make your developers happy because you say, hey, somebody needed something. I got a version one out there. Everybody's fantastic and everybody's happy. And then the user needs increase. And sometimes that's just, okay, need a few more graphs. Okay. There's more complexity, more interactivity that's needed. There's other things that need to be polished, nicer looking buttons. Then there's things that start to kind of grate at you. Like, well, I really like these graphs. Can you have them so they can be exported into PowerPoint? Like, okay, we can. Don't want to do that, but sure. If that's what you needed.

And then after you get all that done, a user might come back to you. He's like, actually, no, my priorities were different. I need something completely different. And that's, this is that progression where you're trying to balance with your data scientists on their happiness and your developer team's happiness. So they can be engaged. They continue to get that satisfaction working in projects and try to avoid some of that grading aspects on progressing and changing your applications and solutions over a period of time. And when you do this in Shiny, there's a number of different issues that pop up. One of which really is it becomes more and more difficult to have multiple developers on working on the same piece of code. Because the way the Shiny applications and the application structure tends to work, you tend to work on one or two individual files. Now there's workarounds to this. Certainly you can source other files, you can do other tricks, but it gets complicated over a period of time.

So you might having two coders who are trying to work on the same segments of code, somebody who's working on a very large segment of code, other people might be waiting on that other segment of code to be done. And there's oftentimes this conflicts that start to arise. And we're going to talk about a project today that was getting to 15,000 lines of code in Shiny. And it was creating a large number of conflicts as the development team was trying to work on new features.

What are microservices?

So it has led us to microservices. If we take a look at those different challenges and those balancing needs, we want to start talking about microservices. So we're going to describe this at a high level and then we're going to dive into it quite a bit. And I will say there will be a forthcoming blog post that dives into the technical details a lot more. So we're going to address this at a cursory level for a period of time. So in short, this is a very short version of it. Microservices really help separate out the different layers of an application. So there can be a web or user interface that allows somebody to navigate information, but that's separate from the underlying logic of the application. Or in the case of Shiny, those things get mashed together into a same or what we call a monolithic code structure because it separates it out because underneath that interface, you have a series of APIs or web services, RESTful APIs, things like that, that allow you to build APIs that control that interactivity.

So when somebody is clicking on something in a web browser, it's communicating via APIs behind it. And behind those APIs is a data storage or database technology that allows you to query and bring in data. And that separation between the user interface and APIs and web services helps simplify your code structure and also allows different people to work on the same bits of code. And the reason why we're talking about this here today is because that web interface and those APIs and web services can be completely hosted on RStudio Connect. RStudio Connect basically is a web server and you can host web files on there. If you upload index.html, it will render that in addition to Shiny and everything else. So it allows us to host things such as a React or Angular or other web technology within the RStudio Connect environment.

Microservices really help separate out the different layers of an application. So there can be a web or user interface that allows somebody to navigate information, but that's separate from the underlying logic of the application.

And for us, we use the React technology. And again, that's hosted on top of APIs and web services, which could be done in either Plumber , which we have done in our team, or using Flask in the Python language. So what is a simple example of a microservices look like? Let's say we have a bunch of time series data and we want to do a forecast of that time series data. And you're actually going to see a demo of this here in a moment. But to explain the architecture, you have data that's holding on to historical time series information. So all those entries about historical entries for the time series. Then you have a model. Let's say it's an ARIMA model that can do the forecasting that is written by data scientists. And then on top of that, you deploy APIs and web services that communicate back to that analytical model, that in turn communicate back to that database, that will actually do the forecast. And then that forecasted data is then presented up to the website or the dashboard that allows the user to see that in a visual format.

And so the reason why we separate that apart is because then the front end, that user interface, can be developed by an independent web developer or a full stack developer. And you can have web designers working on that. So you don't need to hire that shiny developer anymore and focus on that very particular skill set, but be able to bring in web development technologies that are more broadly used in other environments as well. Meanwhile, the web services can be programmed by your data scientists because it can be done in R, the Python language, in addition to the data engineer or backend developer or other full stack developers to be able to build out those web services. And then again, this is in contrast to a shiny application architecture where really that front end and all that logic a lot of times is buried in the shiny piece of the application. And so that restricts the number of individuals who really can do work on that piece. And the takeaway of why this is so beneficial, it really allows that front end development technology to flourish because you can tap into the entire ecosystem of development technologies that can be brought to bear with your web developers.

Hello world demo of microservices

So I talked about a time series example. And so it'd be great to then share an example, actually take a look at a code base of what does that look like? What does Microsoft services look like within an application using a very simple hello world example? And for that, Bijan is okay if I turn it over to you.

Absolutely. Yeah, I will take it on from here. Like Tom mentioned, this is a very simple demonstration of a microservice that would do some prediction. I'll get into a little bit of the details of the back end in a second. But this took not more than a day to put together. The front end is in a React application and the back end is hosted on RStudio Connect using Plumber. This demonstration is pretty simple. A user would come in, they would say, I want to see a forecast of some time series data over the next 48 periods. We'd hit submit. What's happening now is the client side is making a request to RStudio Connect to our API. It's making that prediction with our pre-trained model, and then it's responding with all of that information.

The benefit of having those two separate, as opposed to having Shiny kind of handle it all at once, is that Shiny has to render the front end and do the computation on the back end. That's fine for one person. But when you try to scale things, that puts a lot of load on the server. Having this separation lets the client side do some light calculations and rendering, and then the heavier calculations on the back end. Again, very simple demonstration of what microservices can do. One thing that I want to also kind of call out here is that one of the benefits of having them separate like this is that you can actually touch the back end without the front end.

I'm not sure if I can zoom in on this one. I'll just speak to it briefly. I'll get into a little bit more on the testing, which is what is using Postman, which is what I'm showing here. But one benefit of a microservice back end is that you can touch those back end points without having to go through a web interface. So one thing that I just did is I made a prediction, but that prediction service is actually calling another service that does our logging. So I can also make a request and see, OK, well, I just made a request two minutes ago for a prediction.

Just being able to touch the back end from different systems is a huge benefit for microservices because you don't have everything encapsulated into one code base.

So what I'm showing here is the architecture of that relatively basic microservice app. So we saw the web interface, which is a React based application, and I made a request to one microservice, one service that does our forecast. That service also made a request to a separate service that handles all of our logging.

Reason to separate out logging

This may not, it may not make sense to, you know, why do you have these two separate until you start thinking about when you start building other products, you probably don't want to use the same logging service. So you get that benefit of not having to build in every single app and instead call to the one instance that you have from every single one of your apps.

One example I can give you is about a year ago, we had a fundamental change to our logging code where we were using a package at the time. Every single one of our Shiny apps, which was roughly about 20 of them, had that code installed when they when they were published to RStudio Connect. When we had to make that change, we actually had to have all of our developers pause their work, spend about two weeks making that change in their code, testing out their code, and then republishing their applications, every single one of them. Had we had a micro logging service, it would have been as simple as changing the one service, doing the testing as well, but it would be substantially less work.

Had we had a micro logging service, it would have been as simple as changing the one service, doing the testing as well, but it would be substantially less work.

So that's the reason for the separation and a good reason why you'd want to to separate those duties instead of having it all on one code base. And then the database is very similar to Shiny, you would have some sort of storage that you don't have your your app save data or put it on a file system or something like that.

Tom Schenk & Bejan Sadeghian | Making Microservices Part of Your Data Team

Transcript#

Challenges and trade-offs of a growing team

What are microservices?

Hello world demo of microservices

Reason to separate out logging

How to design and plan microservices

Challenges to getting started with microservices

Application demo: Coda

Recap and benefits

Q&A

Featured software#

plumber

rstudio

Shiny