Resources

Art Steinmetz | Open Source Data Science in the Enterprise | RStudio

Art Steinmetz is the former Chairman, CEO and President of Oppenheimer Funds. In this interview, Art gives his unique perspective on the value and suitability of open source, code-first data science for the enterprise. Timecodes 0:00 Intro 0:10 What are some of the advantages to adopting open source software within an organization? 2:23 Is open source software appropriate for enterprise-level data science? 4:29 How do you build support for open source software within an organization? 5:56 Do executives need to know how to code in order to manage data science teams? 7:56 How and why did you get started with R, and why? 9:55 Any other advice for an aspiring Data Science Leader? For a more in-depth view of Art's perspective on Open Source Data Science in Investment Management, see this blog post https://blog.rstudio.com/2020/10/13/open-source-data-science-in-investment-management/ To learn more about RStudio's professional products, and how they can help you scale, secure and operationalize your open source data science, see www.rstudio.com

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Just over the past few years, there have been enormous advances in data analytic capabilities. The advances in the tools are enormous, but most of the energy is on the open source side of things. Commercial software development is, of course, constrained by the amount of resources that any individual company can put into them, whereas on the open source side, there's just the entire community contributes to development of new capabilities and new tools, in addition to a lot of professional support from the big tech companies for open source, which is growing by the day.

So you get access to really cutting-edge analytical tools because you use open source.

The other interesting thing is that you get access in the open source community to tools and capabilities developed in other domains. Whatever domain you are in, maybe it's environmental sciences or health care, you can get access to tools and packages that were developed for financial services or vice versa, and you see that kind of cross-fermentation really power the open source communities. Many of the most widely used and powerful analytical packages work across multiple languages. TensorFlow is a good example. That is a powerful machine learning framework that was developed by Google. It's open source, and it's available in all of the major open source languages, R and Python, to name the two main ones. So people can use the tools they like to achieve the results in the languages they like, knowing that there's consistency in terms of the underlying algorithms and operation of all these packages.

Finally, code-based analytics make it easy to see the flow of the analysis, and it's essentially self-auditing. In highly regulated businesses like health care and financial services, where I came from, this can be a lifesaver.

code-based analytics make it easy to see the flow of the analysis, and it's essentially self-auditing. In highly regulated businesses like health care and financial services, where I came from, this can be a lifesaver.

Open source for enterprise-level data science

Open source is absolutely appropriate for enterprise-level data science. It's a great way to boost productivity as it can empower all the interested parties in the organization. At my company, there were many people that I, as a senior manager, didn't even know about who were working on developing projects using open source tools. So you can let a thousand flowers bloom. Now, importantly, such freedom can be done with enterprise-level curation. You can do this in a controlled way. Now, of course, IT likes that because you don't have a million people launching different things which are unaudited or unvalidated, but you can still have the same creativity of the team.

Tools like RStudio really help provide this full production life cycle within a curated environment that is enterprise-grade. There are so many people contributing to that community. There are so many tremendous packages out there. You can be comfortable that if you're using a commercial software that has been audited and vetted and has gone through a process before it was released into the wild, that is greater than what might happen in the open source community. But at the same time, in the open source community, the crowdsourcing and the enormous number of people that are contributing means that the quantity of tremendous enterprise-level packages or tools is much, much larger. There are tools that can help companies curate, audit, and validate all of those tools that are out there. Packages like RStudio let you do that and have a walled garden where the appropriateness and quality of all the packages is monitored within the organization.

Building support for open source in an organization

My personal experience, building support for open source software in an organization means starting small. What is very powerful, I think, are small, gee-whiz demonstration projects. Take this in small bites, projects with quick turnaround that demonstrate value. Then you're going to have people looking over your shoulder saying, how can I do that? So you don't have to start out with a large investment. But in my experience, the power of these tools is such that you very quickly have people saying they want in. They want to try to do some of this stuff themselves.

So you obviously don't want the appearance of going rogue in your organization and rejecting IT standards. But very quickly, you will find that you develop IT buy-in. I think as the years have gone by, that's much, much easier to do. I think certainly within our organization, when I first started getting involved in this stuff, there wasn't a lot of IT support. But by the time I left my organization, IT was a major sponsor of open source tools because they knew that they could make sure that anything that was allowed inside the organization did meet our quality standards.

Do executives need to know how to code?

Certainly not. Executives don't need to know how to code to manage a data science team. I was a coder and it did help me be a champion with some credibility in the organization. But it's absolutely not necessary. And you can't really expect that or make that a requirement. You know, the higher you go in any organization, the more you are going to be responsible for supervising people who are experts in their domain and know more about that than you do.

I was a CEO. And of course, as most CEOs do, I was promoted through a particular business vertical, in my case, investments. That's what I had expertise in. But I didn't know anything or I didn't have any particular expertise in sales or legal and certainly not IT. So an executive doesn't need to be fluent in all topics and indeed cannot be. But it helps to be conversant. And some of that is learning on the job. But to do that, at a minimum, you need to be curious. So you will have lots of people who work for you who will know things you don't and you have to be respectful of their knowledge. And you need to listen to those people who are coming in and telling you what needs to be done and listen with an open mind. Ultimately, the leader has to make the decision, but you want to make it with the best possible input from your teams and being respectful when someone says, I don't think this is a good idea. But again, leadership is about making the call at the end of the day. But you have to be very mindful of where the expertise lies in your organization. And that applies in all areas, of course, including leading data science teams.

Getting started with R

I got started in R about 2006. So that's 15 years ago, I became frustrated with the limitations of Excel. I felt I had taken it as far as I could. One of the things that really bugged me was how my current self had no idea what my past self did when I opened a spreadsheet from a year or two prior. So how do I remember where the critical dependencies are across multiple sheets? What does this formula do? All of these things became very opaque in some of the really big, complicated spreadsheets that I did. I liked the idea of open source tools. So I discovered and started working with R. And I started working with one toe in the water. I was moving back and forth with Excel, importing and exporting data, using some macro tools to manage that. But as the months went on, I used Excel less and less because I got more fluent with R and found that I was getting answers faster and getting answers with reusable code much more frequently. So I eventually just Excel just sort of withered away and I stopped using it.

And even when I became a CEO, obviously playing around with spreadsheets or R wasn't part of my day job. But I'd like to keep my self sharp in terms of working with R. And this had a good corporate purpose because it helped me be an evangelist for R within my own organization. And again, by the time I left the company, these open source tools were an everyday part of analytics across just about all of the verticals in my company.

The power of these tools is so, so large, and it's so easy to find really cool examples of what you can do with these tools. It doesn't take very much digging. And certainly, if you look at some of the examples in the RStudio gallery, you see all sorts of stuff, all sorts of gee whiz little projects and dashboards and tools that for me immediately made me think, wow, yeah, yeah, that could solve a problem I have, or that would be a great way to present some of the information that's specific to my domain, even if the information or the data set isn't relevant to my job. So it's very easy to find things that are very exciting to help advance data science in any organization.

And it's still early going on the power of data science in many organizations. There are many organizations, even very sophisticated quantitative analytic organizations that haven't gone after the low-hanging fruit that data science has brought to fruition in the last few years. I'm speaking from experience on this. So, you know, once the ball gets rolling, it's very easy to keep it going.

The power of these tools is so, so large, and it's so easy to find really cool examples of what you can do with these tools. It doesn't take very much digging.