Resources

Andrew Mangano | Growth Hacking - Product Analytics at Scale using R and RStudio | RStudio (2020)

Salesforce is not only a cloud software solution out of the box, but also a highly customizable platform that can be modified for a wide range of use cases. In addition to complexity, customer trust is our #1 company value and customer data privacy is abstracted from everyone outside of the customer. Product and Growth Analytics is an emerging field separate from business analytics and data science and focuses on building software product that improve user retention and engagement. Companies like Facebook and AirBnB have robust data science teams focused on product analytics. At Salesforce however, given the scale, customization, and privacy values, product data science is not so straightforward. Utilizing R and Rstudio tools for collaboration and reproducible analytics, the Data Intelligence team is able to solve complex problems at enterprise scale. This talk will preview anonymized predictive and growth analytics work while also highlighting how we work and collaborate cross platform and languages (Python via reticulate)

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello, and welcome to San Francisco. I'm so excited to be here speaking at RStudio conference. I'll explain a little bit about my background. But this is one of my favorite conferences, definitely top two in San Francisco.

But I'm excited to be here, excited to talk to you about growth hacking with R and different analytics. So with that, if you've ever heard a Salesforce presentation before, you know what's coming next. And that's the forward looking statement. So I'm not really going to talk about Salesforce software today. However, if you are considering buying Salesforce, please make sure that you consider only features that are regularly available in the market.

Personal background

Okay, now that's out of the way, I'm going to give you a little bit of an introduction for myself a little bit of a personal background. I studied undergrad at UPenn, economics and film studies, so left brain, right brain. And when I graduated, I went into a career that was hiring, I graduated in the recession. And I spent the first 10 years of my career at Macy's doing merchandising. And you know, it, it was a great company to learn business. It was in transition, there was a data transition happening. But I didn't quite feel the right fit professionally.

And I was looking for somewhere to go looking how to transition. And it was very difficult. You know, we had a statistics department, but they were all PhDs. And fortunately, the world changed. And right at this point, this little red line, I discovered R. And it was this community that's helped each other, that's educated, that's helped me transform not only my career, but my life, my family.

It took me to a master's in analytics from NYU, I relocated from the East Coast to the West Coast, and worked in e commerce grocery. And then most recently, now at Salesforce doing software analytics, product analytics and data science. And so really, the thing that I need to say, more than anything, is thank you, thank you to this community, you've helped me in ways that you can't really know. And I really, truly appreciate that, which is why I'm so excited to hear to talk to you about my topic, and hopefully give back a little bit.

And right at this point, this little red line, I discovered R. And it was this community that's helped each other, that's educated, that's helped me transform not only my career, but my life, my family.

What is Salesforce?

So before we get started, I'll just talk real quickly, what is Salesforce? If you're in San Francisco, and you ask different people, you'll say, what is Salesforce? Most people say it's a tower. Some people will say it's a transit center. But really, what is it? Salesforce brings customers and companies together. And that's really what our mission is. And how we do that is through a suite of cloud software products.

If you work in sales, you're very familiar with sales cloud. But service cloud powers customer call centers and customer service. Marketing cloud touches almost everybody here through targeted marketing, emails, push notifications. Commerce cloud runs e commerce websites. So there's a whole suite of cloud based applications. Most recently in the analytics cloud for us is Tableau, which is a recent acquisition. And like that, we have a platform of technologies and emerging technologies that are built right into Salesforce. So mobile, blockchain, artificial intelligence with our Einstein. So this is what Salesforce is. It is a big company with petabytes of data.

And who deals with the data? And that would be the data intelligence team at Salesforce. And that's the team that I work on. Our team is divided into five pillars of the modern challenges of data at a company. But what I'll talk to for myself, most important, is strategy and growth. And that leads into this talk.

Growth analytics and retention

So the customer journey, particularly for SaaS companies, deals with acquisition, retention, keeping the customer, and then cross selling and upselling. And so this customer journey has a lot of support behind it in steps one and three. And primarily, acquiring new customers is very robust with marketing attribution analysis and a lot of marketing analytics that is very well codified. And the same thing with the last step with cross selling. Market basket analysis, a priori algorithm, there's books about it, they teach it in school. And so it's something that's fairly common.

But retention analytics, I found, and particularly when we're doing hiring, is a lot less structured and a lot less organized in an academic sense. And so what is that middle piece? And so what I'm calling it today, growth hacking, but also growth analytics and retention analytics, is focusing on the customers that you have. So not worrying about cross selling or acquisition, but focusing on how to keep users, how to keep customers.

And we have some primary metrics that are very important to growth analytics and retention analytics. The big one is MAL. And what's different from software from other businesses, and you can see this when you read it in the news, is that they don't talk about sales. They don't talk about clicks or purchases or items. If you read about Slack versus Microsoft Teams or things like that, they talk about users and active users. And that's the primary metric that we focus on, really targeting in and zeroing on that.

Growth analytics is related to LTV and LTV analysis, but it differs in some key ways. And I'll show kind of an example of that. And really, the interesting thing about this analysis, it's fairly simple. But the key is to finding different segments and cohorts within the data to find differences between that. And I'll show how to do that with this talk.

So we're going to focus today mostly on MAL and retention. But there's a lot of challenges with retention analytics. And the challenge is that we tend to use log data. So we use instrumented data on logs. It could be logs from a website, logs within software. But it's machine logs. This is mostly for engineers. This isn't optimized for analytics. But they're abundant. There's, like I said, petabytes of data that you can use. It's just it's really hard to use because it could be redundant. There could be many challenges with it. And it requires a lot of transformation. And there are commercial tools available. However, they are very expensive. And this is RStudio. This is our conference. We want it to be free and open source.

And the challenge that I find and what I see is that there's a skills gap in taking those raw logs and converting it to an action and an insight. And so what I'm going to show here is using the tidyverse syntax, generating product analytics KPIs with very little code. So if you were looking for something using PCA and something really complex, I apologize. This is going to be very, very concise and hopefully very usable.

Retention analytics in R

Retention analytics in R. All right. So we're going to use four main packages today. Obviously, the tidyverse. We're going to use Plotly, Formatable, which is based off of Knitter, and then HTML widgets. The data that we're going to use is all generated, is random, is not any user data from Salesforce. But that data set is something that I will release about how to generate it so you, too, can do this.

And so when we look at the data, what we see is we see it's at the month grain. We have a user ID. So we have user logs per month. And then we have some, let's call it demographic data behind it. So maybe what industry this user is in, what region. And we'll show you how to slice that on cohorts.

Now, what's really interesting, when you think about retention analytics and the key KPI of MAU, R does not have in its base diff time function monthly units. It stops at weeks. And this was really interesting because you could be kind of working through this and say, why isn't it working a month? What's wrong with my code? Why am I calculating month in millisecond intervals? And so to get around that, this was actually the R community that posted this on Stack Overflow as a custom function to do monthly intervals. And so we're going to rely on that with some other functions.

We take that raw data set of logs and using really two lines, we mutate it very easily to get the most important piece of information for retention analytics. So we're going to group by the user ID, and then we're going to mutate and we're going to add an age and the first month, the start month. And this start month is when we first see that user. That's time zero. You'll see why that's important. And because we use mutate, we use custom functions, it's very easy. We're just going to use this elapsed month function.

This is the retention code. This is what will generate all of our visualizations, all of our analysis. And I apologize if you were looking for PCA or hyperplanes or multi-dimension. It is very easy. It is only three lines. Once we have that age data, we're going to group it by that age, we're going to summarize it by looking at a total users over time, and then we're going to mutate what this retention percentage is.

So this, what we see, is that we have 13,000 users to start with in our total data set, 13,000 users that initially start, and then over time, you see them falling off. And this is very good for somebody to just sort of see in a table, but it doesn't really make sense. And so what we do then is we show it in a ggplot. And what this table shows, what this chart shows on the x-axis, is sort of that age, that life. What is that life of the user, and how many users do you retain over time?

And what's important for software is that if your user base goes to zero after a certain amount of time, you probably have some problems with your product, or you have a certain product that needs a certain fit, you need to keep releasing new things. But for this, if you're in software, if you're in SaaS business like this, and you go towards zero, something's wrong, you need to see that. And this is the base function that we look at.

But what's more important is the analysis and finding those different cohorts. And so just by changing that one line, by changing that group by function, and by adding a color, all of a sudden you have something much more useful and much more insightful. So what are the differences between my users? Where are they coming from? Why is the retention different between the two? And you can see it. Somebody can understand, in this case, maybe how the banking industry is outperforming the tech industry for user retention. And that leads to business insights and then strategies.

Now this is useful, but it's also hard to read. If you look at it, you can say, well, at 10 months, what's the overall difference between the two groups? You can kind of parse it out, but not so great. And that's where we use Plotly. And again, I apologize how simple it is. You take that ggplot that we did in the last chart that we saved, a ggchart variable, and all you do is you just put it into ggplotly. And because of the community that's coded this, the work is done, and you can see very easily, an end user can see, the differences between the groups in a visual form. And they can really see and understand what's going on with their user base.

Now R, being the fantastic tool that it is, can handle much more complexity than that. And so one of the things that we look at is we can look at change over time. And the change over time can be represented by this starting of the different cohorts. And what we see in this chart is that we have two years of data. And while it looks like spaghetti, based on the coloring schemes that R just did right out of the box, changing that date to a factor, you're able to see what's really going on within your user cohorts. And this could lead to some useful insights for either seasonality or differences between customers and user groups.

Building the cohort retention table

What we're going to move to from a visual form, and those retention curves look great on charts, but sometimes you need to go deeper. Sometimes you need a little bit more insights about granularity, about when something happened when. So this is where we're going to use the tidyverse to stack different functions to make very clean chart outputs.

And so what we have here is we have that same retention curve that we just saw in a visual form, but we're looking at it as a table. So that the start month on the left, January 2018, you can follow that diagonal where you retain all of your users. That's when they start. So you have 100%. And as you go each subsequent column, each subsequent month, you can see how many of the customers, how many of the users you retain. And so you can see it falls off. But it's hard to read. It's hard to follow.

So what we're going to do is we're going to use the scales package, clean it up a little bit. We're going to replace the NAs, and we're going to make it easier to read. So this is an improvement. This is starting to get something that's human readable. But we can still do better. We're going to add a little bit more information. And using the tidyverse, we're going to stack a join on top of it to show you the size of the cohort. So now what you see, you're packing more and more information in it. You're saying, which months are important? What is my retention over time? And now you're starting to add some understanding to it.

But it's still a little bit hard to read. It's a lot of numbers. Maybe your data scientists don't like to read as much. Maybe your business partners don't. So we need to make this a little bit easier to understand. And so what we're going to do is we're going to use the formatable package. We're going to add a lot of gobbledygook code that I'm going to truncate at the bottom. But what we're going to do is we're going to format it in something that's much more readable. And so we create a table like this where you have that retention data, that cohort data, that cohort size, and you've color-coded. So now you see over time that things are consistent. But you really can see what's going on with your user base so that you can see differences and try to uncover what's really going on.

Again, R can handle a lot of complexity. And the complexity, fortunately for us, is very easy to handle. So by adding, again, just two variables in very simple places, you can add much more granularity by looking at the industry over time and what the retention is. And so compared to the last chart, you see that this group is outperforming in terms of retention, the other group. But it's these slices, these splits within the data that you can look at to identify differences between users and differences within your product. This could be users that take a certain action. And that's where things really start to get a little bit interesting.

Sharing results with HTML widgets

So we've built this. We've built this tool. We've built it, again, very, very simply. But it's not just for us. We need to get it in the hands of business partners and make it a little bit easier. And you could build it into a Shiny application. You could host a Shiny application. But sometimes there's an easier way. And sometimes you might not have the infrastructure. So with HTML widgets, you just save it as a widget. And what you get is you get an interactive chart that an end user doesn't need to know anything about R. They don't need to know anything about what you're doing. They just need to know the language about how to interpret it. And now they can use it.

So you can generate this. You can generate this automatically. You can host it on a server. And then even without going through something like using R and a Shiny package, you get a table. So again, very, very simple code. And what I've sort of showed was basically what is retention analytics. This is a way to kind of get you primed and started about what this field is about in growth hacking and retention.

Survival analysis connection

But I have a confession to make. Because it's not entirely true about the field not really having academic work and being codified as much. And I'm going to make a plug for your R meetup groups. And so I was talking with the Bay Area R meetup group and I was talking about retention analytics and a few other product analytics. And they said to me, you know, actually what you're doing is just applied survival analysis.

And so I was thinking about it and I looked at it and I said, wait a second, that's interesting. I talked to a university professor about growth hacking and survival analysis. And this was new to them. And that's when I realized that this R community, bringing different groups together, bringing different fields, is really where the cutting edge is. And joining all that thing is where there's a lot of creativity.

And that's when I realized that this R community, bringing different groups together, bringing different fields, is really where the cutting edge is. And joining all that thing is where there's a lot of creativity.

And so the survival packages and some survival analysis packages, I'm not going to go into detail into this. But if you look at that Kaplan-Meier curve, that is a retention curve that we look at in product analytics. And you can see the differences between it. What you have is you have a hazard function and a regression analysis on the right. So looking at that same data in a different way. And that's really what engages me so much in the R community is that you take something that was over here, you take science that was over there, you bring them closer together and we're all better off for it.

So I'm going to make a plug then. If you'd like to learn more about survival analysis and start to think about how you might apply this to your own growth hack and to your own product analytics, there's a fantastic detailed book about applied survival analysis using R. But if you want to get started, there's a gentleman at the conference. I'd like to think he's a friend of mine and he's helped me very much. Joe Rickert runs the R views blog, which I highly recommend. And in this blog, he actually talked about survival analysis using R and is a great primer to take the simple code that we did and that data and use survival analysis and apply to it. So go to the R meetups, meet people, interact with everybody, and new ideas will come out of it and make all of us better.

So if you're interested in Salesforce, this is the slide you want to take pictures of or you can come talk to me. Join the Ohana. Join data intelligence at Salesforce. We are a leader in doing well and doing good. And so a lot of the themes that JJ talked about this morning is really what Salesforce stands for. We are a leader in innovation, a leader in philanthropy, and most importantly, in culture. Trust is our number one value. Customer success, innovation, and quality, it is one of the best places in the world to work. I absolutely love it and would definitely recommend you to, again, check out their careers.

And with that, I definitely want to say thank you. Thank you for having me. I hope this was helpful to you, and I definitely look forward to answering any questions or connecting with you later at the conference. Thank you very much.

Q&A

So we have the retention stats grouped by these variables. What are some actions you can take to increase the retention rate? That's a great question. So this is like the step one. This is understanding what your users are doing and then trying to figure out strategies what you might do to improve retention. So you might actually do some discovery about why a certain group is falling off. And if you're talking about software analysis, you might have to build new features, build new products. If you're talking about traditional business, like how I've used this before, you might have to work on your product market fit.

You mentioned that you have petabytes of log data where the analytics are applied. Can you talk about what teams inside Salesforce are in charge of processing that data? How do you get that into an analyzable form? Is that your team, another team? How does that work? Yeah, I would say it's that larger data intelligence organization. So petabytes of data is very challenging to work with. But there's so many tools now at scale between Spark, between Hive that we use. And really has great connectors to all of those to connect to it. So much so that I do most of this by myself with support of infrastructure teams.

Is there a good way to account for how many months a user could be retained? So I'm not sure I understand. So how long someone could be. And that's really where those curves kind of come in. So if somebody falls off early but maybe someone else who looks like them, a lookalike model, maybe hasn't, there's a lot of randomness that you could see. But I would say definitely that's really where the fun is in the analysis and trying to understand that.