
Enabling Remote Data Science Teams | RStudio (2020)
Whatever happens in the coming months, remote work is here to stay. The goal of this webinar is to provide data scientists and data science team leaders with the knowledge and tools to succeed as a distributed team. Some of the topics we will cover include: - Setting up a remote and collaborative R environment - Version control and Scrum - Using Shiny and RStudio Connect to share apps within and across teams - How to improve the UI and appearance of Shiny dashboards - How to scale Shiny dashboards to hundreds of users - How to build and grow a remote data science team About Alex: Alex is a Solutions Engineer at RStudio, where he helps organizations succeed using R and RStudio products. Before coming to RStudio, Alex was a data scientist and worked on economic policy research, political campaigns, and federal consulting. About Olga: Olga is experienced in production applications of analytical solutions, especially for FMCG companies. Recently she developed a price elasticity model for Unilever. About Damian: Damian is one of the four co-founders of Appsilon. Before founding Appsilon he worked at Accenture, UBS, Microsoft and Domino Data Lab. About Pedro: Pedro has nearly a decade of experience combining frontend and backend technologies, and is an expert on augmenting R Shiny dashboards with CSS and JavaScript
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi everyone, I'm really excited to be here. So let's start. I personally believe that the future of knowledge work will eventually be remote. So in some sense, we all got to experience the future in the last few months. However, this might not have been what you hope for, because setting up a collaborative environment for your data science team can be challenging, even when working side by side in an office. At Apsilon, we spent years developing efficient systems for remote collaboration. And in this presentation, I will cover the best practices that work for us.
So I'm Olga Mierzwa-Sulima, and I'm a senior data scientist and an experienced project leader at Apsilon. I have a background in econometrics, and my main technological stack at work is R. I have been building our community here in Warsaw since 2014. I plan to cover three areas in my talk. So project management, doing effective code reviews, and setting up a painless development environment in R.
Project management with Scrum
So when it comes to project management, almost all of the projects that we run in Apsilon are managed with Scrum approach with small variations. At the beginning of the project, we sit down together with our client and collect the requirements that are translated to high level tasks in the project backlog. At this stage, we also provide a high level estimate on how much time we think we need. For example, we contract work for eight weeks, each with one week long sprint.
Every sprint starts with a sprint planning, where the dev team and the client select the tasks to work in a given week, and we also set the varieties. During the week, we meet daily inside the dev team to get each other updates. We also use those meetings as an opportunity to catch up on the small things that are happening inside the team, because we don't have those random meetings in an office that happen next to the coffee machine. The rest of the communication during the week, we try to do async using a dedicated project channel on Slack.
We also use Slack to give each other updates such as that we start or finish work or are in the do not disturb mode. We also hold one on ones between team members to discuss specific problems related to the project. But coming back to our scrum process at the end of the week, we do the sprint review where we show to our client the increment. Usually this means doing a demo of a Shiny app with new features or going to a notebook that validates new data science hypothesis. We also hold retrospective meetings to celebrate things that went well or to discuss potential improvements.
We use project board to help us manage backlog. We usually use tools available from GitHub, Asana or Jira. So agile development is standard and common practice, but making it work for your team is a clue. So we make sure to always have a well-organized project report that is reflecting the current status of work. So basically, if I look into the project report, it should tell me who is working on what tasks and what are the varieties.
Also, at Absalom, we have this rule that all work needs to be documented. So whatever you are doing, there should be a corresponding task describing it on the project report. The task needs to have a clear and informative description, which is understandable not only to the dev team, but also to our clients as they have access to the project report as well. We introduced an implementation plan, which is something that is non-standard to a classical Kanban view. Basically, the person who is doing the task has to provide a description how they will approach the task and get an acceptance from another team member before starting the actual implementation. This way, we make sure that any proposed change makes sense and we verify it early enough. Finally. The reviewer is responsible to check if the solution corresponds to the task requirement.
Effective code reviews
As you probably noticed, our project management flow is tightly related to code review. I will assume that majority of you are familiar with that concept and I will only talk what makes it effective. We have a checklist that is part of our definition of done for a task and includes the following points. All the code needs to be peer reviewed and there are no exceptions from this rule. So even fixing a typo in the documentation needs to be peer reviewed.
We also make sure that continuous integration checks are such as linter, unit tests, integration tests are configured and passed. Next, any added or modified code has to follow our style guide. This is especially important because it allows us to ensure quality, do more effective code reviews, learn from each other and spot bugs in the code more easily. Next, all the changes needs to be tested manually or with automated tests. Further, we need to make sure that no new errors or warnings have been introduced and that readme documentation and code comments have been updated according to the change.
We try to automate as many things as possible and we use continuous integration with GitHub actions to perform automatic checks and let the machines do the job. For example, we automatically run unit tests after every code push. We also use project templates to initialize the structure for typical project types and use full request templates that contains a definition of done checklists from the previous slide to help the reviewer do the review.
Setting up the development environment
So finally, we can talk about the development environment. It's the most crucial part when working in R, otherwise the same code can give different results on different machines. Other projects can be affected, as in a global environment, packages can change and crash unrelated projects. Deployment can give you a real headache because establishing infrastructure can be challenging if it's not being tracked. And last but not least, your team members will waste time setting up an environment rather than jumping straight to work. This might not be a problem when starting a new project, but imagine adding a new team member in the middle or at the end of the project.
To solve all of those issues, we do the development in RStudio running inside a Docker container, which is fixed and dedicated per project. We leverage Docker and REND to control underlying system, system dependencies and R packages. Docker and REND package together make team collaboration easy. All the changes to a Docker file or REND log file that records packages used in the project are committed and pushed to Docker Hub or GitHub from where they can be accessed by other team members.
Previous solution, however, still requires some level of DevOps skills to set up everything. So some of our clients choose RStudio server because it allows their data science team to purely focus on delivering value using their core competencies in a highly secure and flexible development environment.
There is no secret recipe to make your data science team work efficiently. In fact, we found that using Scrum with well-organized project board, doing code reviews and taking care of the development environment is essential for project success, no matter how your team is structured. Thank you very much.
And next, my colleague Pedro will tell you how to build a dashboard that can speak for itself. When dealing with users adoption, you can always be next to them and you cannot always be next to them and explain how your app will work. It's even more difficult when working remotely.
Building dashboards that speak for themselves
So I think one very important thing when it comes to remote teams is that you're not always building dashboards for for external users. You're sometimes also building dashboards for your own teams to work in. And like Olga said, when you're in the office and you're next to someone, it's very easy to just push them in the right direction of how to use it. But if you don't have this direct contact with with your users, you end up having a good solution for this is to put a bit of extra work into your dashboards and actually make them speak by speak for themselves, especially when there's alternative dashboards. Users typically have a choice and they will usually go for the one that feels better.
So small background from me, I used to work as a web developer, so I used to do pages, I used to do applications, and eventually in the last couple of years I landed in R and in Shiny. So one of the great things that I ended up learning about Shiny is that deep down there's a lot of JavaScript, a lot of CSS, a lot of HTML involved. So this actually allows you to allow them to reuse a lot of what I used as a web developer to actually work on Shiny dashboards.
So. Why does UI actually matter? The question is, if you have a dashboard that works, why should you actually care about making it look nicer? And the truth is that UI, the way your dashboard looks and feels is very important for your users. So this is the first experience that your users are going to get using your application. And especially now that everyone is so used to the way the Internet works, the way the applications in your phones and tablets work, users expect things to behave in a specific way.
It's also very important to notice that users don't usually care what's in the background. And this is valid even for some Shiny apps. If your users use the application to get a specific insight or to get some data, they don't really want to. They don't really need to know how complicated the background process is. They want to have a nice experience. They want to be quick and efficient at using it. And in general, you end up with with this this concept that if users don't feel engaged and comfortable when they use your application, they might get frustrated. They might stop using it. Maybe they'll pick a different alternative if it's possible.
And in general, you end up with with this this concept that if users don't feel engaged and comfortable when they use your application, they might get frustrated. They might stop using it. Maybe they'll pick a different alternative if it's possible.
Six key UI tips
So instead of going over the whole UI and what you can do, I decided to give you a couple of things that you can look at a couple of. Parts of the application that you can keep in mind, either when you're using an application or you're giving feedback on or when you're actually building the application. So I picked six very important topics, very important points when it comes to UI.
The first one is you should keep your interface simple means any element that isn't actually doing anything. You should probably remove it. It also means cleaning up your dashboards also means that you'll pass a very clear message of what you're trying your users to or what you want your users to achieve. You end up with something that is very straight to the point because all the elements there actually serve some functionality.
Another important tip is to be consistent. So if you have a text input, reuse it. If you have a button, reuse it. Users kind of learn to navigate your your application. So they expect elements to have the same kind of behavior. They expect flows to work the same way. And you can kind of define patterns of how your application flows when you have very complex applications.
Another thing to keep in mind is to be purposefully purposeful in your layout. And this basically means that just because all the information is there doesn't mean that it's being shown in a good way. So adding white space where there's no information is important. Using using the UI elements to actually create some kind of structure. And in the case of dashboards, this is the sidebar, the sidebar navigation, the top navigation. All of these can help you create a bit of structure when it comes to your dashboard. It also means that you can draw attention to important information. You can you can use colors, you can use the size of the elements to actually change the way that your users perceive parts of it.
Other one is you a few really good things that you can manipulate when it comes to to any kind of web application. These are color, texture and typography. So work, work with these. Don't overdo it when it comes to color, when it comes to contracts, when it comes to shadows. Use them to. Kind of attract your users to specific points of the application or to make things less and less evident and less important from from from your application. Remember that different fonts send different messages, so. Bold, large phones are a different message from small, very non bright phones, for example. And also remember that the size of the phones, the way the text is arranged, it really helps people go through your application and be more efficient at their work.
Also, remember to provide feedback. You want your users to always know where they are, what they're doing, if the application is still doing something or not. Using messages, progress bars, anything that gives some kind of feedback, feedback of what's happening really helps reducing the frustration, especially when you have scripts that are running for very long times or you're uploading a lot of data. You don't want to think that the app just crashed or that something is wrong.
So, yeah, so I was saying that it's also very important to provide feedback to your users. You don't want your users to know that. Your app, you want you want users to to be aware of what's happening with the app. This this usually means using progress bars, using messages, using notifications. You should always provide some kind of feedback to make sure that your users don't get lost or are waiting for some very long process, some very long script that's running and they have no idea what's happening. If if you have something that you know is going to take more than a few seconds, odds are the progress bar. If you have something that can be done in the background while your user is using the application, make it maybe show a notification when when when it's over. All of this constant feedback really helps when it comes to the user experience.
And finally, remember to think about the default values. Shiny dashboards live of inputs of user user inputs. And a lot of the times these inputs can be defaulted to some values that are very common for users to use. If you have a drop down, but you know that's most commonly this is the value that's used are the value to that drop down or preselect the drop down. If you have a checkbox that most of the time is selected, selected by default, anything that's avoids unnecessary actions by the user and can make the experience much faster and much smoother.
How to achieve better UI
So how do you actually achieve all of these things? So there's two ways of of going at it. The easy and quick way is use UI packages. So we're all familiar with shiny and it's it's default UI. But you can, for example, use shiny dashboard to completely change the way the application looks. And some of these UI tips are already implemented in the actual elements provided by shiny dashboard. Another good package is shiny semantic. This is based on a different CSS framework, but it also comes with a lot of elements that you don't really need to think about some of these of these tips. But you already have some elements that already include them.
Another option is that you can just create your own your your own layout, your own styles. And I I am giving you here three very common examples that I use. So maybe you found some some HTML templates online that you would like to use. Use HTML templates. This is a function that just lets you import a full HTML file to your shiny application. Do you want to style something? Use CSS if you if you have a lot of styles and want to go very deep into the into changes, you can even try SAS, which is basically a preprocessor for CSS. It's CSS for developers, let's say. Do you want to add custom behavior that isn't default out of the box? If you're building something very complex, you can try HTML widgets. But for very simple behaviors, you can even use just vanilla JavaScript with with a couple of of actions.
And there are some examples of very popular packages that were done using HTML widgets, for example. So you can see that these are standards for for what we currently use as developers. And there's they are very they're actually very easy to use. So I'll just quickly finish with a couple of examples of what you can actually do. So this is an example of just the dashboard that started as a simple shiny dashboard. We then implemented we then added shiny semantic, a bit of custom styling, and we got a completely different different feel for the actual dashboard. Another example, and there's a link in the bottom. I actually did a couple of months ago, the whole idea of going from just the base shiny dashboard into a fully, fully complete custom solution.
So the dashboard started like this. I didn't did some testing with shiny dashboard where the feeling was completely different with just a few changes. Semantic dashboard also completely different feeling. And then we ended up with a super with a completely custom built solution with a lot of CSS, a lot of custom JavaScript and even a team switcher. So you could in general, the limit is your imagination. And just as a final, final example, this was my entry for this year's shiny contest. There's a blog link in the bottom that you can follow if you want to read a bit more about all the technologies, everything that actually went into this. This is actually this is a shiny dashboard. I promise you, if you don't believe me, you can check the GitHub repo.
Scaling Shiny applications
Thank you, Pedro. So, yeah, it's amazing, we are going through a journey together when it comes to working remotely. In my part, I would like to share with you how we can scale a shiny dashboard. So let's take a look at first. Let me introduce myself. My name is Damian Rodziewicz, I'm one of the founders of Appsilon. I have worked with many different languages like Scala, C Sharp, Python, C++ and JavaScript. And I always like to combine all the knowledge from different technologies and apply it to R and Shiny. I must say that throughout my experience, I have never seen a language with a library that are so fast to build great looking applications that are then used in production.
So short agenda, I'm going to start with an introduction that then I will give you an overview of the scaling and some details on how you can make your applications faster with just a few simple rules. So this is you and your team. Thanks to OlgaTalk and your knowledge, you already built the workflow for your team. And knowing Shiny and making your application beautiful, you deploy an application that is available in production for your users. And then you start having people using it.
First, this is one person. Suddenly, there is a little bit more of them and you get the feedback. And I really like this slide because it sums up our experience when we talk to different clients. There is always someone who created like this new fresh change in the company. They introduced Shiny. They suddenly built an application that everyone likes and people just love the application. The features are added fast because of the fact that Shiny is so robust and the application does exactly what people need in their daily job. And probably a lot of you know this.
However, suddenly you have so many people using your application that the application starts getting slow. And this is usually when we jump in and help companies organize their code and to make sure that they can scale to a lot of people. Because the fact is that Shiny is fast, RStudio Connect is fast and allows you to scale even up to 10,000 users. But it needs to be implemented in a given way with a given architecture. And I'm going to share with you what has to be done in order to achieve such scaling.
So first of all, let's look at Shiny. There are two ways that you can scale your application. The first one is vertical scaling. This is increasing the amount of users that you have for one machine. And the other part is horizontal scaling, which means adding as many machines as you want to just to make sure that the application runs for covers for as many people as you want. So one thing to know is that you should first work on vertical scaling so that your application is fast and very robust. And then you can add as many machines as you want. Otherwise, you're going to use a lot of resources.
So for the first one, there are three basically key things that you can do. First one is leverage front end, which is to use JavaScript to handle fast user interactions that do not change the data. And that's so that you don't have to actually communicate with the server that often. The second part is to extract computations to handle the resource intensive operations away from the application. So that the server itself isn't that overloaded with the computations, and the third one is to set the right architecture that I'm going to show you as well.
One thing that you should keep in mind is if you want your Shiny application to be fast and if you want it to work for many users, you should make sure that the Shiny layer is thin. This should be a communication interface between the data and the UI. So the interface. So this is really what we see when we join different companies and work with them. There is a server, there is UI, and there is plenty of communication happening between the front end and the back end. And what you should be doing to make your application fast is first, you should make sure that a lot of operations that do not need any communication with server happen on the front end. And that the server actually calls the external services to fetch the data and to filter the data there, not inside of the memory of your machine. And then you can add as many machines as you want.
So let's take a look at the first one, leveraging the front end. A few very simple things that you can do to already scale your application. I'm going to go through them very briefly. First one, rendered inputs in UI and only update them in server.r. This is a very simple thing that you can already apply to your applications. Very often people create a UI output in the UI and then they put all of the rendering in the back end. This requires re-rendering the whole widget whenever the inputs, whenever the other inputs change. And this makes the application slower. What instead you can do is you can render the numeric input on the front end and only update the inputs on the server. Very simple change already gives you a lot of leverage.
Second part, you can inline JavaScript code with ShinyJS package by dinatali. Very simple thing allows you to call JavaScript right from the server and you don't have to generate the whole HTML on the back end. But you can just ask JavaScript to toggle some changes in the UI without actually going back and forth. And the third one is to set all actions in JavaScript without server.r part. So when you create, for example, an action button, you can right away create an onclick hook for the JavaScript call. And what is important here is that if you have skilled people in your team and you have created a great application that is used by a lot of people and you want your team members to grow, the JavaScript is a natural next step for them to learn to make the applications robust, look better and move faster.
Let's talk about extracting computations. First idea is to use a remote API and Plumber from our studio is a great package to do that. So when you think about your application, you shouldn't load the entire data set to your application. Instead, you can ask external services to filter the data for you and you just show the user the excerpt of the data that is useful. Because very often, as Pedro mentioned, you don't need to put too much information to the user. They just want to see what they need. Then you can very easily wrap data extraction logic into a simple API with Plumber. This is done just by adding special comments. There is a great documentation about it. So make sure that you check it out if you don't know Plumber. And what is really important is that you can easily deploy this with RStudio Connect. So just as you have your applications running on RStudio Connect, with a click of a button, you can also deploy API. And many applications can actually use the same API, which allows you to reuse your code across many different projects.
Side note, make sure to use efficient data libraries, because very often when you have an API, you want to share some data and you very often also read it from the disk. Actually, there are many different packages for reading data and you can read a huge amount of data without affecting the memory of the system a lot. So make sure you check out those packages and check how they actually improve the app performance. Sometimes just changing the library that you use for manipulating data can unblock a lot of potential in our application.
So for the extracting computations, the second part that is very often important and crucial for the applications is to start using a database. The success story usually starts with a simple application and the data is just loaded into memory. Looks fine. I have a UI, I have data and then I have a server that filters the data and shows the results to the user. Just a simple application. However, if the file is big, like, for example, one gigabyte, it's going to cause you some trouble. First, when you have just five users, RStudio Connect is going to distribute them evenly between processes and they are not going to use, for example, three gigabytes of RAM. If you have 13 users, it can get a little bit more, but the machine is going to handle it well. However, the more users use your application, the more processes you have to create and each process is going to keep the same object in the memory. So it is crucial for you to instead of putting all of the data into memory, to put it away to a database and to just use filter methods and different selections to get the data that you need. And then your machine is going to be very light, very thin, and you can have a huge amount of users using your application and the database is going to handle all of them. Also, important thing to know is that you can use dbplyr to use different filtering just as you are manipulating data with dbplyr.
Architecture for scaling
So let's talk about the architecture. We all know Shiny Server open source. It allows us to very quickly deploy the application. Right away after you deploy the application, you will see that RStudio Connect is a great choice to have multiple processes to have better control on how users use your application. And it also has a great functionality of being able to deploy the applications with a single click of a button. What we like to use on our site is we use Ansible to provision the whole infrastructure. So imagine you have a bare metal machine. With Ansible, you can install all of the packages that are needed for RStudio Connect. Then you can install RStudio Connect and then you can even deploy the application because you have a great package, rsconnect, which has the whole API to deploy applications from the command line. So really, I think that everyone who does the development on a high scale, on a big scale, using Ansible is a really good choice.
Now, this is the architecture that you would have for your applications. You set up multiple servers and they all have to be behind the load balancer with a sticky session. And they all have to connect to the same database. They have the authentication mechanism and the shared disk. As a bonus, there is also a great option that you can use so that, for example, you can have an offline mobile application that is using endpoints that are deployed to RStudio Connect with Plumber. So this is also something that we have done successfully. So there is plenty of things that you can do, and I encourage you to try them.
You may think that this is all. Of course, this is all from me. But when you think about it, you know how to create the application, you know how to make it beautiful, you know how to collaborate with your team, and then you can even scale it. But now comes the big thing, because when you have such success, then you want to share this success with others. And that is why Alex is going to tell you what you can do to actually go beyond one single project and actually make this collaboration fruitful for everyone.
Growing a remote data science team
Thank you. Awesome. Thanks, Damian. I will share my screen here now and start presenting. Great. Hey, everybody. Thanks so much to all of my co-presenters, Damian, Pedro and Olga, for leading us through the better part of this hour. I'm super excited to sort of close out here and talk a little bit about me, make sure I have the right click. There we go. About growing a remote data science team. My name is Alex Gold. I'm a solutions engineer at our studio. And if you feel like tweeting at me, my Twitter is down there on the bottom.
So let's let's jump right in. Growing a remote data science team. Just tell you a little about me. This is my face. Since you can't see it on the video right now. More importantly, this is my puppy. She's much cuter than I am. And so my background is before I came to our studio, I was a data scientist and a data science team lead. Now, here at our studio, I work as a solutions engineer. And what that means is that I spend my time helping teams figure out how to make the most of open source data science tools like R and Python and how to make the most of our studio professional tools.
So what I'm talking about here is sort of a combination of things that I learned as a data science team lead and also some things I've seen out in the wild throughout my time here at our studio. So let's start by talking a little about like what it means to grow a data science team. And so there are really four ways I think about growing a data science team. So the first is just straight up size, right? Like the number of people on the team is one way to grow a data science team. You can also grow a data science team in terms of velocity, right? Like what is the output per person? How much are you churning out? How much work are you getting done? How many models are you creating that sort of thing? Another way you can sort of scale a data science team is in terms of reliability, right? Are you confident that the outputs that you're putting out are correct? Are you confident that they're there, what they need to be? And then, of course, sophistication. Right. And that that could mean maybe graduating to using more advanced machine learning models or things like that. Or maybe it just means, you know, going from using mostly Excel to using more code tools. That can be a really great way to sort of, you know, grow a team in particular ways.
R packages as a tool for team growth
So what I'm going to spend the balance of my time talking about today is actually a tool that I think can really help with with all of these things. And that tool, of course, is the R package. So you might be like, wait, what? You're going to talk about R packages. But I want to take a little step back. So instead of talking about like the mechanics of R packages, right, like how do you write an R package? I'm not going to talk about that. There's lots of great stuff written about it. I really recommend like Hadley Wickham's book, if if you're curious, that's that's a great resource. But what I want to talk about is the R package as a tool to grow a data science team. How does the R package work for us? Not what is the R package?
But let's start off just like describing for anybody who's not super familiar, like what is an R package? So our package is a way to sort of container up, package up a bunch of functions that you use commonly in R, like much like you could put a dachshund in a cardboard box and it would be nicely contained. A package is a way to contain a bunch of R functions that sort of hang together for for some reason. And so I want to talk about like two different ways to use R packages on your data science team to help it grow.
And so the first way is the team wide R package. So this is when your team writes one or sometimes a few. Sometimes the team will have several, depending on what you're doing. R packages that are for specific things and they're used across the team. And so the kinds of things you might like have in those R packages, like maybe you have particular analytical functions that you use certain machine learning models. Maybe you like always weight your regression the same way or something like that. And that can go, you know, those functions can go in our package. Maybe you have certain plotting defaults you use a lot, right? Maybe you use the same templates or maybe you actually have the same types of plots you're creating a lot. And putting those those plotting functions in our package can really help. Maybe you have templates, you can put code templates in our package. You can also put our markdown templates in an R package, which I think is a really cool use of an R package, right? Like you could have a report template that is the same way that use every time, put it in a package. Now, everybody across the team has it.
And then the last part of, you know, things that that are really great in team wide R packages are data accessing functions, right? Like maybe you have particular ways you use DBI, the DBI package or ODBC and and wrapping that into an R package means that like not everybody has to remember how it works. They can just use the function in the R package. And that's that's really handy. So this is what goes in the team wide R package. But as I said, like, we're not going to focus so much on the mechanics. Like, what then is this R package for? And to me, a team wide R package is really for three purposes.
The first is size, right? It's for scaling your team, particularly now that we're remote. Like you can't sit next to somebody and talk them through like this is how we do an analysis. And so the documentation and the functions in an R package are a really great tool to onboard new people to your team. And our package is also a really great tool for increasing velocity on your team. You know, if you have if you have functions in there that sort of, you know, make your work much easier. And particularly if you with every product, you go back and improve that R package a little bit. That's going to make the work just, you know, every product is going to get easier and easier and easier as that R package gets developed further. And of course, sophistication. Right. If you have, say, one person on your team, it's just a really great R coder. You can take advantage of their skills across the entire team. Or if you have like one really great machine learning engineer. Right. If they if they write those those specialized machine learning functions, everybody can use them all of a sudden. So like to me, the team wide R package is really a tool for onboarding and upskilling.
So like to me, the team wide R package is really a tool for onboarding and upskilling.
Like that's that's the purpose of the team wide R package. I do want to give you two keys to making this work for you, though. So the first key is that everyone contributes to the R package by having everyone on the team contribute. You get much more buy in. People are excited. Right. They're excited to use their own work. They're excited to use the team's work. And and having that team wide R package is a really, really valuable resource in that case that that will actually gets used. And the second key, which is way more important, is that your team wide R package must have a clever name using the capital letter R. This is not optional. It is required. But I will give you a hint. Blank blank. Help R is always available. So, you know, you have no excuse for not having a clever name using the capital letter R.
OK, so that's the pattern of the team wide R package or R packages. Another really popular pattern is the package per project that is like every analytical project gets its own R package. And this is a really, really great pattern as well. And but to me again, right. Like the the team wide R package is about upscaling, scaling and upscaling the package per project is really about reliability. That is that it allows you to document what you're doing in a much more thorough way. And it allows you to test what you're doing in a much more thorough way when you put those functions into a package, as opposed to just sort of using them in in in your analysis and just sort of having them that way.
So I do want to go through this in just in just a little more detail here, which is that, you know, so so this is how I would structure a project that I were using this like R package per project structure. And so what I would have here is I have my like top level project directory. Right. So that would be like the whole project would be encapsulated in this. Incidentally, if I'm using a Git repo, I would put it at the level of the whole project. You know, you can have sort of these subdirectories inside one Git repo. Inside that, though, I'd have these two subdirectories out of the package and the project and have those separate from each other. And I'll come back to that in a second. But they're sort of parallel to each other. Right. The package and the project get developed in parallel. And, you know, something like DevTools load all is your friend here while you're while you're working on this. And then you have your your your project functions go inside your R directory. And this is where, right, like you can put your oxygen documentation. You can actually document all of the functions you're using in your analysis. And that's that's a really great thing to be able to have that documented for when you come back to it later. And of course, then you can also use tests. Right. And being able to unit test your code is is really important.
And this is actually really important beyond just sort of reliability that that you care about. So like maybe you just want to make your your products more reliable across the board. And that's that's a really great thing to do. But also, like in some places, there are requirements that everything be tested before it go to production. This is a really great way to pull some of those tests out of an interactive mode and put them somewhere that you can actually run a script. And so this is one of the really great things about the project, our package per project model. Then, of course, you have your actual, you know, whatever the actual work is of your project, then sit separately from that. Right. That's in your you know, you might have an R markdown document. You might have an app dot R. You might have some dot R files. You might have, you know, a plumber API like like dummy and talk. Right. All kinds of things can go here. But that's separate from the package.
And so I want to stick on this point for a minute. If any of you are on like our studio, our stats Twitter, there's been a really great discussion the last few days of like, should each project be a package? Not like should you write the project? The package per project is should the project have a package? And I would say, yes. There's also this whole debate of should the project be a package? And my argument is no on that. I just want to be clear, because I spent like this, you know, the last eight minutes just saying packages are amazing as a tool for developing your team. But I don't think every package should be a project, mainly because you end up with a lot of weird metadata on your project that doesn't work quite right. You can end up with some sort of weird circular dependencies where the package depend on it, depend on itself to do stuff in itself. And that that's really messy. And also, you can end up in a state where your package isn't actually installable. That's not a great place to be. So I would recommend the package can can be a great tool team wide to share functions across people. It can be a great tool for increasing the reliability of an individual project. It's not meant to be a whole project, but that's not the intent of our package. And I would argue, at least, and there are others who will argue very intelligently to the contrary. But I will argue that you should not have your entire project be in our package.
OK, thank you so much for tuning in. Just a couple of my high level takeaways. Our packages are a tool for scaling and skilling your remote data science team. A team wide package is a great tool for increasing size, velocity and sophistication. And a per project package is a great tool for increasing the reliability of your work. With that, I'm going to pull back up the slide from Sam and I'll hand it over to her.
Q&A
Thank you so much, Alex. And thank you so much to all of the presenters. I thought that was a really fantastic story, starting from how to build a team to the actual tools you need down to what to be successful. So let's go ahead and jump into some questions here. I've got a few plugged in. We've got the first one here saying, could you elaborate? I think this is going to be for Olga. Could you elaborate on how people technically work together using RMS, Docker and GitLab? What's the operational pipeline here to use as a best practice?
So usually how we how we do the development is that we when the project starts. So at the beginning, we build a dedicated Docker container for each project. And inside the the Docker container, we we set up in our studio. So we do the development inside the Docker container and to keep track of all of the packages and their versions and make the collaboration for the users easy. We use rent package, which is a package available from our studio. And whenever there is a whenever we install a new package, we install it to rent. So the rent log file is updated. And later, when we when we finish the work, we we need to do the either we do we commit all the changes. So we commit both the the Docker container and the changes to the code. And we can either we commit them if we use GitHub and we do it to GitHub or GitLab, it doesn't really matter which. Which service you use and and the the the Docker file is the new snapshot or you can also rebuild the image. It's a push to the Docker Hub repository. So basically, this is the this is our world workflow and how to exactly set up rent with Docker. There is a vignette available from rent package that's more or less describes how to do this.
Just wanted to quickly add something. Very often, we we also have a base image that already contains cached libraries that you very often use. And multiple projects can use this image as a as a base so that you don't have to install everything from scratch. So this is something that we recommend if you want to have reproducibility across your team. That's great. Thank you so much. And and in that, are you Olga, are you using RStudio desktop or RStudio server, RStudio server pro? What does your tooling look like there?
Well, this really depends. So either we are using the RStudio desktop or also we can simplify this workflow as well. So get rid of the Docker. But then we are usually working with RStudio server pro. But then we still use rent to take care of the the control over the packages. But it really depends from from what actually our clients want to achieve.
Let's see. That's great. Thank you. And then let's see, I'm just looking at making sure we got questions for everyone. I see here, I believe this one is for Damian. So what recommendations can you make when the dashboard requires permissions and content dependent on user viewing the dashboard? Follow up on that, is LDAP a good fit here or what other LDAP or other authentication sources might you use?
So I think that we there are two ways of handling the access to the data. Like one part, of course, is authentication. So who is allowed to access the application? And with RStudio Connect, you can actually hook up to any service that you have internally in your company. Very often companies use LDAP to have the same users, the same passwords. And you can also have single sign on and all the all the other good things with RStudio Connect. When it comes to authorization, it means that you can have two users logged into the same application, but they see different parts of the app. And usually this is something that you have to implement yourself within the application by assigning roles within the app with the business context that you have.
One quick thing, I'll just add to what Damian said, which is exactly right, is that one of the ways you can do that inside Connect is that you get in the session data, you get the user and the group. So that's one of the ways that you can sort of take that data in a running session and then pass it along to, you know, use it to do authorization is that you actually get that data inside the session. Exactly, that's the very good point.
That's great, thank you. So now let's quickly jump to some package talk. I'm sure everyone loves that. So, Alex, what are your recommendations when functionality should be extracted from a project specific package to a team wide package? Oh, that's that's a great a great question. My my favorite reference here is David Robinson, who's a data scientist and sort of big in the in the RStats community. He has a very he repeats says and repeats, you know, if you that you should do a lot of work in public. And if you if you say something three times to somebody, you should put it in a blog post. And my my sort of corollary to that is that if you find yourself using the same function three times, you should put it in a package. So that would sort of be my recommendation. Right. It's like, you know, it's certainly if you have three projects where you're using the same function, you should probably extract that into into a package. I would say sometimes even two is worth it. But but to me, really, the flag is just like, do I want to use this package, use this function more than once or twice? Then it belongs in sort of a team wide package for the per project package kind of pattern. You don't necessarily need to reuse it. Sometimes the testing and the documentation is worth extracting it into that that package all by itself. But if you're thinking about, you know, obviously you want to keep the namespace of your package somewhat compact. And so, you know, waiting until you need it two or three times to pull it in is is often worthwhile.
If I may add to this, the great tool that you can use a product is our studio package manager, which is basically a copy of Kran within your own organization that you control and you can install all of the packages there. And this allows you to share this one package, which is private to your organization, to everyone within the organization. So be sure to check it out. That's great.
Thank you, Damian. So I think we've got enough time for one last question here. I think this is a question that anybody can answer. So and I'd love to hear what all different folks think about that. So what are some strategies to maintain consistency in file locations on the server for data databases? Our scripts are markdown projects and dashboards. We have issues with files everywhere, but no structure. So what are some tips that you might have on maintaining that consistency?
I'll go ahead just real quick. So I think a really great solution for this is to use a project based workflow. If you look up the Tidyverse blog, Jenny Bryan has a great article on using a project based workflow. And so what this allows you to do is to sort of make projects more self-contained. I think that's a really great place to start. Obviously, there are other tools in terms of sort of like consolidating data and database access and things like that. But starting with a project based workflow, I think
