Resources

Brandon Farr | 15 Years of R in Quantitative Finance | Posit (2020)

Use of R in the investment industry is established and growing. This talk will discuss changes seen in 15 years of practice within asset management firms. I hope discussion of lessons learned and recommendations will benefit those currently in finance and those interested in hearing how the flexibility of R manifests in the financial world

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good afternoon and welcome. My name is Brandon Farr. I'm the lead quantitative researcher at Copper Rock Capital Partners. We are a Boston-based long-only equity firm serving the institutional marketplace. We have clients from around the world with two-thirds of the money we manage coming from outside the United States.

I started using R in 2005 because it was the most powerful analytical tool I'd ever come across. Each year, R has become more powerful. Fifteen years ago, R use in finance was rare. It's become more common, but what has been the impact?

Let's take a look at this quote from Alliance Bernstein. We estimate that at least 99% of so-called fundamental models for individual companies manifest themselves in Excel. So our industry is still relying on a tool built for the data challenges of the 20th century. People use Excel because it's easy to learn. But as we all know, it then takes on a life of its own. Ad hoc solutions end up becoming the basis for important decisions. But better data analysis requires reproducibility and scalability.

But better data analysis requires reproducibility and scalability.

So in preparing for today, I decided to focus on what makes R compelling for those cases. What's going to dent that 99%? I'm going to focus on two important use cases. First, using R to do better analysis. By better analysis, I mean you spend more time interpreting your data than manipulating it. Second, using R to build what you cannot buy in the marketplace. I'll use Shiny to make my case.

Working with financial data

We start with data. Good analysis is based on sound data. Financial World has many providers supplying data to help us understand companies and markets. Bloomberg and FactSet are a couple of the providers that I frequently use. Here's Apple's price chart in the Bloomberg terminal. In the terminal, I can change the time period, data frequency, and overlay calculations onto this chart. This is great, but it doesn't scale. How do I look at 10 stocks? How do I look at a portfolio of stocks?

To do scalable analysis, I usually need to report or analyze what I'll call reported data. Mine typically comes from FactSet and is formatted for printing, not for processing. You can see the header is messy. Dates are formatted as text. Rows are grouped together. But R allows us to do good data analysis even when we start with messy data.

The limits of Excel for scalable analysis

Let's look at a typical scenario. As an investment team, we need to know what is driving markets. Here's data from the fourth quarter of last year. These are sector performance reports for the U.S. and global markets. In the spreadsheet I get from FactSet, this data comes separated into tabs, and the stock data is grouped by sector and hidden because everything has been collapsed.

Now, tables of data are useful, but visualizations are better. With a chart, we can quickly see that healthcare and technology drove both the U.S. and world markets in the fourth quarter. This is hard to pull from tabular data, especially if it's on separate tabs. So how would we visualize this in Excel? Well, I would do something like this. I'd gather the data into a single tab, making sure everything's aligned so that if I'm labeling data, the labels are appropriately placed. I'd create a plot referencing the group totals. And then I'd customize the plot.

I'm sure most of you have done something similar to this and are aware that pitfalls lie ahead. It's my experience that anything that's useful leads to further questions. Is our Excel process up to the task of answering further questions? Okay, fine, but what worked in the emerging markets? Did healthcare and technology lead in the fourth quarter of 2018 as well? How do we answer these questions with our Excel process?

Well, I'll argue that just one additional tab of data is going to require me to rework my entire process. If I need to turn this around quickly, which I often do, I can only hope that I'll be accurate. To ensure that I'm accurate, I probably need to start over. The Excel solution just feels fragile. My sense is that even if you are very good at plumbing in Excel, extending an established process is burdensome, fraught with error, and probably requires a lot of repetition. This is a major hurdle to doing better data analysis.

Doing better analysis in R

So how does R help us do this analysis in a better way? Well, in R, I'm going to complete the same steps, but I'm going to do it in a structured way with code. I'll specify the data file. I will read and prepare the market data. I will collect the group information, and then I'll plot those groups. R allows us to break a process into logical, distinct steps. With proper naming, the logic of the process is explicit and readable. With care, when we need to extend this process, nothing needs to change.

Let's take a minute to look at how easy setting up robust processes has become in R. This is important because we are trying to convince ourselves and others to do the same sort of analysis in R, do more of it. Remember, we're trying to put a dent in that 99%. This is the heart of our process here. Here we see good examples of names indicating function and logical separation of tasks.

Let's look at the function prepareMarketData. Our function reads the tab names from the file and puts them into a tbl. Next, we make calls to read grouped data with the help of mutate and map. After this step, we have a tbl with two rows representing the two markets in our analysis, with all of the underlying data conveniently stored in list columns. This is a very powerful structure and an example of something that wasn't possible 15 years ago, even in R.

But how did regroup data handle all of the messiness of our data? Though reported data is often messy, it is usually consistent. By taking advantage of the patterns we observe in the spreadsheet file, we can systematically handle the collapsed data. Here we see sector and company names occupy separate columns in the spreadsheet, plus the sector group row stands on its own. Healthcare, shown here, is not carried down the column. So in this representation, healthcare or group membership is implied. With R, we make it explicit.

This is a good example of how clear it is in code when we are manipulating data. All we need to look for, in this case, is mutate commands. Copy and paste manipulation is documented nowhere. Additionally, a function like this is a perfect candidate for a package, so it can be reused across many analyses.

Collect groups is just a simple function that unnests the list columns and then filters to just give me the group total rows. If you guys haven't seen this pattern, it's one that's very powerful. Note that we will maintain the natural relationships between the market names and the underlying market data automatically. Finally, we pipe to a plotting function that captures all of our logic.

So, we have encapsulated the entirety of our data process into four functions, none more than seven or eight lines long, and all very readable. This is better analysis with R. We get reliability. Our process is replicable and efficient because we aren't copying and pasting. Errors or future changes can be traced to the impacted steps and corrected or implemented. In my experience, processes set up in this way typically fail not because of coding problems, but because of data changes.

A quick aside. Last week, I was asked the specifics of how I performed a what-if analysis on some of our historic portfolio holdings. It had been months since I had done the work, and yet within two minutes of reviewing the code, I was able to answer the question accurately and confidently. If I had done the work in Excel, I would have had no chance.

It had been months since I had done the work, and yet within two minutes of reviewing the code, I was able to answer the question accurately and confidently. If I had done the work in Excel, I would have had no chance.

Building Cognos: a custom Shiny app

Next, I'd like to show you how we have used R and Shiny at Copper Rock to build something we couldn't buy. Our quantitative model covers over 10,000 stocks. In the past, we delivered this to analysts and portfolio managers in an Excel spreadsheet. Here's an example of one of the weekly files. It only contains the most recent data. It's frustrating to filter. When you apply filters in Excel, you can't understand the logic. The best you can tell is whether or not a column has a filter applied to it or not. It's difficult to see the relationship between groups of stocks, and it's difficult to integrate this data with other data from our vendors.

So we created Cognos. Cognos is a Shiny app built to meet our team's specific analytical needs. Nothing available from Bloomberg or FactSet matches it. Even though we are a small, focused shop, R and Shiny has allowed us to create this powerful custom application.

Here's a snapshot of Microsoft through the lens of our quantitative model. Our model data indicated here comes from a Microsoft Azure database. We pull in data from Bloomberg via their API, and MSCI ESG data is sourced from flat files. All of this data provides the context that was missing from our Excel-based solution. Data is no longer siloed. Analysts don't have to make the connection. Cognos is doing it for them.

Just seeing the data in the middle of the screen moved our conversations from conjecture to data-based reality. We no longer had to rely on the recollection that, oh, the model has always liked Microsoft. Cognos can tell us, and the orange line tells us it hasn't always liked Microsoft.

On this screen, you see two important components of our quantitative model represented in the scattered plot. Higher scores should indicate better stocks. We've found it most productive to reserve our fundamental research for highly ranked stocks. So by focusing on stocks in the first quadrant, analysts can better allocate their research time. With the interactivity of Shiny apps, our analysts quickly navigate their way through a massive universe. Cognos really helps them kick-start their analysis. It is used by our team all day long.

R and Shiny helped us build something we couldn't buy. Cognos has become our hub for data delivery. It is naturally extensible, needs no training, and requires very few IT resources.

Closing thoughts

So we walked through an example of how easy it is to set up a process to do better analysis in R. Then we looked at how Shiny apps can help you build something you can't buy, matched to exactly the way your team works. I think it is a great example of how a relatively simple Shiny app can have a significant impact. Recall, my focus for today was to make a dent in the 99%. I think these two use cases are examples of how we might start. I encourage you all to chip in.

Q&A

What package, new package, are you most excited about?

Well, as I said, I think I'm just excited about all the changes happening in the tidyverse dplyr. That pattern of being able to read either the tabs of an excel spreadsheet or files that you would get from calling dir on a directory. And using that to be the starting point for a table, which then you use map to pull all the data in. Just allows you to naturally keep either file locations, names, and data all together. And then it flows naturally through the rest of the process. I literally just started using that a month or two ago and enjoy it quite a bit.

I've used R for 15 years. I don't think ggplot2 was out. We had plier and reshape. I think that what can happen as you work, you get set in one way of doing things. And it's difficult to get off those patterns. So this pattern may be old to you guys, but to me it's been a great refresher to my work that I do.

What led you to the use of list columns instead of multiple tables at different levels?

Two thoughts. Attending RStudioConf and listening to Jenny Bryan talk about them with the Lego characters. But it's also just, to me, it just makes it easy to think about the data. Because if I need to get into the details, I can go get it. But very often I want to work at that summary level. And if something's spread across multiple tables, then to do that work, I might need to do joins just even for exploratory analysis.

And you mentioned that you're working with FactSet data, and I assume other vendors. And when the format of that data changes, is that a challenge? And how do you deal with that challenge? Yeah, so I mentioned that this process that I laid out is robust typically to code errors, but not to data file formats changing. Example, depending on when you come in, FactSet will either have seven or eight rows in the header column. Or seven or eight rows in the header of a file. And so that used to kill, if that was hard-coded in a bunch of different areas in your code, that would kill the code. But now, like I said, taking that reading or processing code and separating into a single function, and then reusing that function. If it changes, I make the change once, and I can use it going forward. It's also something, obviously, that's obvious to have as a parameter in the function call.

And have you considered, almost getting political, democratizing R across the firm outside of your immediate group? And I think as an example, replacing Excel with usage by analysts. And if you have or have not, where's the friction you're seeing from that? Sure. So Cognos is the first step there. When I got to Copper Rock, that quantitative model that we delivered every week as a 10,000-row spreadsheet was actually printed out. And people would print it front and back, open it, use rulers to scan and find it. So that was the level of user I was starting with. So to jump right into coding was not going to happen. But I think Cognos is the first step there.

And then I've started more in meetings, because of the readability of dplyr code, actually dynamically answering questions with the code rather than saying, I'll get back to you. And I hope that as people see me do that, the code's up on the screen, they'll start to become more interested.

There's a related question here about delivering data to non-R users in the org. Is Excel the only way that you're doing that, or are there other ways? We do some via Microsoft Outlook emails, but it's very rudimentary. It's a flat text file, so it's not very... I forget what the RStudio mailing package is, so I haven't used that. Blastula. Blastula. And, yeah, I think that's it. So it's Excel and it's Cognos. It's on 10 desktops. I mean, so it's roughly half of our firm, so it's a pretty good deployment.

What was your workflow when designing the UI for the Shiny app? How did you make sure the data was presented in a useful way to the bottoms-up panelists? That's a good question. I think I tried it about three or four times, and I bounced back and forth between Flex Dashboard and Shiny or just standard Shiny. I have two levels of tabs because there's a group of information that's related to stocks, information that's related to our portfolios, and each of those has tabs underneath it. And I had a hard time working in Flex Dashboard, so I went back and coded it from the beginning.

But I use this tool in our morning meeting every day. It's up there. We bounce back and forth between Bloomberg and this tool. And so I certainly get a lot of live feedback with people who are having issues with it, or as I hear there's other data they'd like to see there, I incorporate that into the design.