Dmitri Adler & Merav Yuravlivker | The shift to data: Industry trends in finance

And banks and financial regulators are continuously looking for solutions to combat threats that are not necessarily imminent, but equally are large and systemic.

In terms of identifying consumer behavior, there is a big push towards identifying non-traditional sources of information about population movement and population behavior. So what happened was, in the pandemic, people stopped showing up to places in person, right? This virtual presentation is a case in point. And so banks started to ask themselves the question of, do we need physical branches? And if so, how many and where? And so we've seen use cases where banks are pulling cell phone data to start to understand the new behaviors and patterns of physical movement, juxtaposing that with their general ledger to understand how the local bank locations are being utilized and whether or not there's enough transaction volume to justify actually having a branch open, whether it's from a deposit growth standpoint or a loan growth standpoint.

There's a variety of technology products that are increasingly used to transfer value, right? We've all heard of cryptocurrencies and the latest trend in non-fungible tokens, which is basically pieces of code that serve as a proof of record that somebody indeed created something and somebody else owns it. What's interesting about the utilization of technology as a proof, as an evidence of something occurring, is that if you think about the nature of markets, it's all about supply and demand. And where there's a constricted supply and an excess demand, you start to have price growth. In fact, the largest, the best performing asset class over the past 30 years has actually been fine art paintings, effectively, right?

So it's a constrained supply. You're not going to get any more impressionist paintings. And you're looking at prices of works skyrocket from $2 to $20 million over the past couple of decades. And so the ability for technology to serve as a proof of record that something indeed happened is opening up spaces for artists and other creative use cases to actually leverage technology and then finance to actually grow and create value. That's a fascinating trend that's poised to remake a lot of the financial space.

And so when I saw that sitting back at JP Morgan's desk at the time, it became very obvious that had somebody been doing this analysis on open data all along, they could actually identify choke points in the financial system, identify this type of systemic risk much, much earlier.

What does it mean to be data driven?

So thank you so much for telling us all about that, the data that you're using, and hopefully that's maybe sparked some ideas about additional data sources that you can use. You know, one of the big trends that we're seeing in finance, and I think Dimitri alluded to earlier, is the fact that more and more institutions are focused on becoming data driven. But a lot of times this term is used in a nebulous way. And so one of the pieces that we've found to be most helpful is to be able to define a little bit better, what does it actually mean to be data driven? How do you know if you're data driven and how can you identify the key pieces to work on within the organization?

So the way that we've identified that is across two different axes. We have our data literacy, which is the overall knowledge as well as the governance and oversight within an organization. And then also the data infrastructure, which, especially in finance, is crucial to ensure that data is accessible appropriately and that it's also stored securely.

So thinking about data infrastructure, we've identified these three pillars. Data collection, data storage and data access. So in terms of data collection, you know, I saw a lot of evidence about the data that folks are using with its customer transactions, CRM data, student data. So presumably there's some collection that happens on a continuous basis. If not, that's definitely something to look into. But even more importantly, making sure that the data is collected in a way that is then easy to analyze and to store. Is also pretty crucial. So if you have a bunch of data that's disorganized, maybe there's a lot of missing values that makes it a lot more difficult to pull insights from it.

So ensuring that the data collection that you have is not only collected in a timely manner, but then also stored well and then stored in a way that makes it easy to pull insights from. The second pillar is data storage. I think everyone on this call, especially if you're in the finance industry, understands a lot of the information that you're dealing with is personal information, whether that's credit scores, spending histories, bank account information. Ensuring that your data is stored securely is paramount to that.

But at the same time, the third pillar, data access, is also very impactful because if your teams cannot access the data that they need to make timely decisions, then that makes it difficult to use. And that makes it difficult to use the actual data that you've collected. So, you know, you can start to ask yourself, the data sources that you're using now, is it easy for you to access? Is it easy for your colleagues to access?

And then on the other axis, we have data literacy. And with this, you know, one of the biggest challenges that we see is a communication gap between the data and the non-data professionals within an organization. So becoming a more data-driven organization really starts from the top, and that's in terms of data leadership. Do executives actually champion data utilization? I would say most do, but maybe they don't understand the resources that they need to allocate or the time they need to give people to better understand how to use data. So if you have a data champion in your organization, if you have one person who is helping set that data strategy, then that's a good sign that your leadership is taking this very seriously.

The second pillar that we have under data literacy is data governance. So I think especially in finance, you probably have a lot of guidelines where you can access the data, how you can access the data, who can see the data. Do you have these guidelines somewhere? Does everybody have access to them? Are people using data in a way that's uniform to them? So making sure that across the organization, people are using data in the same way so that it's easily accessible to all and easily transferable and readable. So something simple or something straightforward, if you don't have a data dictionary in place, putting one in place already can help you ensure that all of your variables are defined in the same way, which makes it a lot easier to clean and then to analyze later.

Last piece is data knowledge. Does your staff, do you, do your colleagues know how to ask the right questions about data? Do they understand how to interpret the insights about data? Without that, it makes it really difficult to start to become more data driven and to use data to inform your insights. So, you know, a lot of ways that we address this is specifically about developing programs within organizations to help them become more data driven and to drive that continuous culture of learning.

So as I'm going through this, maybe you have some questions or maybe you've identified, hey, what's a key pain point? Like, maybe there's data that you're unable to access, or maybe you're seeing that a lot of your colleagues are having difficulty asking questions about data. You know, one source of data that we see a lot are a bunch of Excel spreadsheets that are live on somebody's laptop. And then the minute that person leaves, those Excel spreadsheets disappear, right? That's something that we refer to as dark data. So this is something that can be ameliorated with data governance or data storage.

So if you are wondering, where is all of this data? Does all of my data live on Excel on my own laptop? Maybe start thinking about how you can transfer that to a secure space that's maybe that's on a cloud environment so that others can access it besides you, because that type of data can be increasingly more important.

Steps to start with data analytics

So talking a little bit about this, it all starts with us, right? It all starts with everybody in the room. So how can you start with data analytics if you're not already doing this already? Make sure you're asking questions about data, ask for metrics, make your metrics specific, make it measurable so that people understand what you're looking for. And that can help drive the insights that you find and then the decisions that you make behind that.

For inventory, you know, we ask what type of data do you have access to? I'm curious how many of your colleagues know what type of data you have, right? So finding the information that you have access to, maybe you'll discover that you have this whole other database you weren't even aware of. Maybe you find out that you're already collecting data about something you had questions about in the past. So especially in a lot of large organizations, we find that most people don't know what's available to them. And that's not just in terms of data. That's also in terms of tools. You know, maybe your organization already has RStudio enterprise, right? And you weren't aware of that. So asking those types of questions can help you better understand that as well.

And then, you know, for collaborating and talk to your colleagues. What we've seen is when you get a bunch of people in the room to start talking about the challenges that they're facing data or the data they have access to. You'll see that a lot of you are working on the same issues. And to be able to work together on that, you're, you know, instead of doubling the work because you're both solving it individually, solving it together tends to make that work go faster. And you might also see that there are other, you know, aspects and other data sources that you didn't realize you had.

How can you support data literacy? So beyond actions that you're doing yourself, what can you do within an organization? Doing these types of lunch and learns internally, perhaps. Maybe you'll find this useful. Hopefully you do. So bringing in other experts to speak about these trends, to speak about how data is being used, to speak about these different types of use cases to inspire others around you to start to incorporate data. Go to data conferences, setting up training opportunities based on skills gaps. So where do you want to improve your skills in the data spectrum? And even planning events such as data competitions can be a really nice way to identify top talent. And then also encourage others to start to think about how they can apply data to their work.

Ethical use of data

So before we finish up the presentation, this is an area that we always like to emphasize both in our training programs and also in our presentations, which is about the ethical use of data. So this is a case study that maybe some of you know. This was one that was done by Target. So essentially about 20 years ago at this point in time, Target had a lot of information. They're a huge retail store in the United States for those who might not live here. And they had a large customer base of data about purchases, past purchases. And wanted to start to predict which of their customers would be pregnant because they know that when an individual is pregnant, they tend to get more set in their habits. And if they can catch somebody during that stage, they tend to be more loyal customers throughout their life.

So in order to do that, they had their data scientists sit for about two to multiple years of data to better understand, OK, which of their customers became pregnant. And then what were the key factors there that they can use to then predict who will become pregnant? So they did this analysis. And at the end of this analysis, they found some key indicators included buying ginger ale to help with nausea. Stopped buying wine was one of the factors that had a big correlation, as well as buying prenatal vitamins and things like that. And based on their predictions of who would become pregnant, they started to send out targeted flyers. And you'd think this is a great use case of data analytics and it is a demonstration of an effective use case. But what ended up happening is they sent one of these flyers to the parents of a 15 year old who hadn't told them that she was pregnant.

So that became a big case that happened with a few other families as well. And it started to bring up the question of it's one thing to be able to mine customer data for insights. But how can you make sure you're using it in a way that doesn't cause harm, doesn't have ethical implications to it? So it's important, especially given that we're in the finance space, to think about how we're using customer data and how we need to be mindful of that fact.

Now, interestingly enough, what Target ended up doing, it's not that they don't use that data for insights anymore. What happens is that they start to put lawnmowers next to baby cribs. So they just are a little bit more subtle about how they advertise that in the future.

Speaking about ethical considerations, another one that's popped up, especially in the past few years, is about biases that exist in our current data that inform our models. So, again, it's our responsibility to ask these types of questions. There's one where it's, you know, how does the data reflect the society that we're in today? And if it does perpetuate any stereotypes or biases, how can we build models for future prediction or, for example, granting loans that will mitigate some of those biases and risks so that we don't perpetuate that? This is a difficult question. I'm spending a few minutes on it, whereas there's a much larger conversation to be had. But the important point to get across here is, as you are building these models, as you're involved in these models, make sure to ask these questions about, you know, accounting for biases before the data is being put into the model, making sure that the model accounts for that, and understanding how to mitigate that so that you can make the best decisions for your customers without introducing, you know, some biases that might already exist in the data.

Building data literacy in financial institutions

But on the other side, there's that data literacy component of that, and this is an example of how a lot of financial institutions today are starting to realize it's not only the infrastructure they need to have in place, but it's also the appropriate staff with those skills. So we've done a lot of work in the finance space with that. One of the use cases specifically is with Discover Financial Services, where they actually developed a pretty robust infrastructure. And what they realized is they had a lot of new hires coming in that maybe had some of the prerequisite skills with a good foundation, but not all of the skills that they wanted to, you know, that they needed in order to be effective.

So we work with them to build an onboarding program that helped train them in R, as well as in other tools, including SQL and Snowflake, so that by the time that they left that onboarding process, they felt much more comfortable within the infrastructure that they had, so they could be much more effective off the bat. We see a lot of this happening, and it's not just the onboarding process, but because hiring is so difficult right now for data scientists, they're a rare breed, and I'd say especially in the finance industry, it can be really difficult. A lot of financial institutions are now turning to this training aspect, whether they're doing it internally or they're bringing somebody else in, to help develop that skill set across the organization.

So I'll finish up with this question here. I know that there's some other questions in the chat that are more specific, so I can pause here and just say, first of all, thank you for coming to this talk today. Rachel, thank you for organizing it. We love RStudio, especially the Hangouts, and we find that, you know, just the community has been really supportive and lovely, so we want to make sure this is valuable for you, and now is your time to ask us questions.

Q&A

Thank you so much, Merav and Dimitri. That was great. I see a lot of questions coming into the Zoom chat, and then we also have the Slido link if you want to ask questions anonymously there as well.

I think, Rachel, the first question was from you in terms of what kinds of organizations provide employees with technical skill development versus expect you to figure it out on your own. It's tough to paint with a broad brush. We work with a number of financial services organizations that invest heavily in new employee onboarding and continuous employee development. We work a lot with Discover, Inter-American Development Bank, Capital One, all those organizations I know have robust new employee and existing employee training programs. I know we had a very robust one back at JPMorgan, so it's tough for me to say, well, this type of organization likes to do this versus not that. So it's tough to put them in buckets. But I know a lot of organizations do have great programs. And based on our experience, those who do not have something like that in place are increasingly looking to deploy those types of programs. We're seeing a lot of interest.

And I'll just add on to what Dimitri was saying, where I think almost every organization that we speak to or work with has some sort of online training component. And that can be really helpful, I think, especially for a lot of folks on this call who maybe already have a foundation in R or Python and maybe just need to scale up in one or two things. But we found that a lot of people who are new to the space tend to get really overwhelmed with the amount of information that there is. They're not quite sure where to start. And that's where organizations are starting to bring in something that's a little bit more tailored and a little bit more structured to help both increase accountability, also develop that community of sharing and collaboration. And so doing that kind of in tandem with one another helps build that robust learning culture.

So thank you, guys. It was a brilliant presentation. Thanks for that. So my second question was around one of the challenges that we face is figuring out how best to have access into standardized schema documents like XBRL and then pulling that into R for some sort of standardized financial modeling and analysis. Do you have any tips or experience that you could share there?

Yeah. So in our experience, it's been less about the domain and more about the approach. So XBRL is a very useful tool to publish financial records data. But when we are then taking those financial records data and trying to come up with an analysis, it's much more about, for example, a data structure that lends itself to text analysis. So as I mentioned, Neo4j has its own structure for meta-tagging and relating data in non-star schema ways. If you're looking at graph analysis, it's a different approach. If you're looking at classification algorithms, they all have their own sort of data formats that you would need. So unfortunately, I haven't encountered anything specific that sort of is universal for financial modeling. It's more specific to are you looking for portfolio analysis? Are you trying to figure out the Sharpe ratio for your portfolio or are you looking for churn metrics or something else? So it tends to be more analysis style specific.

Neo4j, I've used here and there, but my curiosity is more of, is there some sort of rule of thumb when scaling up Neo4j? Because we all know sort of graph network type tools have a scalability problem. The more vertices, the more edges, everything explodes sort of exponentially. So I was just curious if there's a rule of thumb around number of CPUs, amount of RAM as your network scales for sort of making Neo4j run in a malleable way on a cluster.

Yeah. So I am unfortunately not the right person to ask. I'd need to have our director of engineering on the line for this question. But I know he figured this problem out last year. So I can connect you guys afterwards if you drop me an email.

And just letting everyone know that I did put in both my email and Dimitri's email in case your question doesn't answer or you want to learn more. Or I think it's Eugene's question, you need a more specific answer. We're happy to connect you with the right person and continue that conversation as well.

Gregor, I see you asked a question around R and Python too, if you want to jump in. Yes. Thank you very much for the presentation, Mirav and Dimitri. It was very interesting. And I would have liked to know, because I know Python is also more and more used in the finance industry. And I would have liked to have your impression and feelings about how R and Python evolved or are adopting. And also the second question I asked, what do you think are the two or three other main languages that are entering the finance data driven analysis, please?

So you're right in that the sort of Python user community has grown somewhat faster because it's grown on the backs of software engineers. For example, I started as an R user. And so that's still my first love as far as programming languages go. And I find that, for example, R's syntax tends to be simpler than that of Python. So if you have a software engineering background, picking up Python can be pretty trivial because you're already familiar with the notion of data structures. If you are starting from an Excel or sort of VBA background, then Python might look somewhat foreign and R is going to be a lot easier for you to adopt. Most people who are in finance actually started in some form of spreadsheet software, right? So to them, transitioning into R would actually be easier.

And in my experience, R is going to give you just easier, faster options for prototyping, proving your point, doing the analysis, creating interactive visualization, standing up a light application with something like R Shiny . So I'm a huge fan of R, and that's the main language in which I program in. I don't think that the fact that there's been such a huge uptake in Python, for example, precludes another language like R from being just as widely used and helpful. Ultimately, I've learned it's about your personal preference and what gets the job done faster. Those two, I think, still dominate. There are going to be some languages that come into play when you're talking about scale systems, but I would attribute them less to finance. There's nothing about finance and the structure of a different language that make them better adapted to finance, right? But when you're talking about some scaled applications, especially for transaction processing, Scala is obviously important, right? But that has to do more due to its sort of software engineering and data piping capabilities rather than because it's better suited to finance in some way, shape, or form, in my experience.

Based on your experience, how do you find convincing or working with teams where the business side are very Excel, well-versed, and they might find R a lot more palatable. But the technology groups then come in and say, okay, no, we are not very comfortable with people using R. On the software side, there's a much bigger preference for Python. Working with clients or even within your organization and going through this maze of what should be encouraged or discouraged, or can you even have both of them working? And what kind of relationship should there be between R and Python or between business and technology people? How do you navigate conversations like those?

Sure. So I would push on the word comfortable and who is comfortable with what and who is in the right lane in terms of being comfortable with what, right? Software engineering is going to be responsible for secured scale code. But the finance team is responsible for the analysis being correct, right? And so in my experience, the way that I see it is if you're writing a piece of code, you better be confident that it's working correctly from just a mathematical standpoint. And if the finance team is using R and is comfortable in R, then that has to be the code that actually runs the financial analysis. Now, there are lots of ways to call an R routine from Python and a Python routine from R. So they are interchangeable and lots of hooks around it. So it shouldn't be, you know, everybody must use one language or another language. There are lots of ways to make the two compatible.

And in fact, we do that regularly in our line of work, right? There's a lot of packages that are just better in R than in Python and frankly, vice versa. And you shouldn't limit yourself to one or the other. So then, you know, the conversation is, look, if you're writing most of the workflow in Python, why can't you call this R subroutine? Because our analysis works in there. And if the engineering team wants to reconfigure the ETL in Python, well, that's their prerogative, right? They're thinking about the compute time. But there shouldn't be sort of a debate about, you know, do you need to change the actual mathematical analysis from one language to another? That doesn't feel like a good use of time given the interoperability that's available.

Just scrolling through the questions, I see Davin, you had asked a question. Yeah, I'm just really curious. You talked a lot about the financial sector having trouble finding qualified staff to bring on full time. And what do you see for trends in that type of analysis being available for freelance workers or outside organizations, consultants, things like that? If you just kind of comment on that, I'd appreciate it.

Sure. I don't have any stats to speak about the trend on freelancing specifically. But generally, I try to think about things from first principles. And when a person or a tool does and does not make sense. So I'll answer this question kind of the same way that I answered the R versus Python question. There are great use cases for when to use somebody who is a freelancer. You're usually paying them on a per project or an hourly basis. And usually they would be more expensive than a W2 employee. And you can usually get them onboarded for that very narrow kind of use case much faster than a W2. And so there is a reason to utilize a freelancer, for example, or an outsourced firm when you need to get something accomplished really quickly. But it's an in and out kind of task.

If you're talking about somebody who is a permanent hire. Well, there's a legal question about do you want them to be a 10-9 or W2? It's actually more of a legal and HR question than a substance question. And then finally, if you have a capability that's ultimately going to be core. You probably want that to live with your full-time staff who kind of bet their careers on working with you for the long run. So in my experience, whether you're outsourcing or freelancing. It's great to get initiative going to make sure that the capability is online quickly. But then equally have a plan to kind of have some full-time staff. Having said that, I don't have a command of the trend of freelance use. But I bet if you went to someone like Upwork, they might have some data analysis around that.

Emilio, I see you had asked one in the chat too. I just wanted to know if you have any preference regarding ESG applications, ESG-related applications into credit risk. I know it's a very young field, still working on progress. But I would appreciate if there is any R library I can start learning or any other kind of reference that might be useful.

More specifically, there's a big push to implement climate impact into loan portfolios right now. But it's limited to mortgage. I want to know if there's any other kind of type of loans that might be already being modeled or any effort around that.

Yeah, so I could talk on this topic for hours. I'll try to keep it short. There is a really big push across both regulators, investors, and industry participants around how to measure climate exposure, climate risk. That is true for both energy producing assets. So infrastructure is a big area where this applies. And equally loans to companies that have meaningful climate exposure, not just insurance companies, but businesses based in coastal areas. I know the SEC is doing a lot of work around the appropriate amount of financial disclosures that would be required from regulatory perspective with respect to ESG or climate change risk. We wrote a big white paper for him about that topic. So there's no standard framework that I'm aware of right now. They are all, but there's a lot of research and they're all generally sector dependent. So you mentioned mortgage, certainly, right? Again, for coastal areas. Energy generation assets. That's very much true for them. But I'm not aware of a framework that sort of says, here's exactly how to think about it pervasively.

Just seeing a comment from Brian about the fact that most finance output ends up in Excel and PowerPoint. So using R Markdown simplifies that task. I couldn't agree with you more. We're very heavy users of R Markdown in a lot of stuff that we do from training to the actual sort of solutions, custom solution projects that we build. So I couldn't agree with you more. I'm a huge fan of R for financial services applications. And I think that there needs to be a sort of intellectual acknowledgement that there is software development and there are good tools for software development. And there are tools for finance where you need to get to an analytical result quickly and be able to communicate that easily with other decision makers. And I find that R is much better for that second use case. So from my standpoint, it's not about what's better R or Python, right? They're both great tools, but they have very distinct use cases and they have very underused libraries for calling the subroutines from each other and creating interoperability. I think it's much more about that than what is better.

And there are tools for finance where you need to get to an analytical result quickly and be able to communicate that easily with other decision makers. And I find that R is much better for that second use case.

Well, thank you so much for having Dimitri for an awesome presentation. And I'll give people a few more seconds if you want to jump in and raise your hand with other questions. But that was great. And it's awesome to get the community together.

But thank you all so much for joining today. And we're having Dimitri. I'll work on getting the recording together. And if you have the slides, you could send over to me. That would be awesome, too. Yep. Well, we'll get that to you. And thank you so much for putting together such an awesome group. Super interactive. Lots of questions, which is exactly what we love. Hopefully, everyone feels like they're taking away some new information today. That's really our goal is to just empower people to use data better. And however, we can help do that. We're happy to.

Yeah, I echo that. Thank you, everybody. Marwa, thank you very much for the kind words in the chat. We really appreciate all of you for dialing in and all the really good questions and the great discussion. Thank you very much. And of course, thank you to the R team, to Rachel and Kevin for organizing this and welcoming us and putting this whole production together. Obviously, this wouldn't happen without them. So thank you so much.

Dmitri Adler & Merav Yuravlivker | The shift to data: Industry trends in finance | Posit

Transcript#

Case study: non-traditional data in financial risk

Key use cases across financial services

Venture capital and skills demand

FDIC tech sprints and operational risk

Audience discussion: data sources in use

What does it mean to be data driven?

Steps to start with data analytics

Ethical use of data

Building data literacy in financial institutions

Q&A