Resources

Data Science Hangout | Moody Hadi at S&P Global | Unlocking Business Value with Data Science

We want to help data science leaders become better. The Data Science Hangout is a weekly, free-to-join open conversation for current and aspiring data science leaders. An accomplished leader in the space will join us each week and answer whatever questions the audience may have. We were recently joined by Moody Hadi, Manager of New Product Development and Financial Engineering at S&P Intelligence ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.rstudio.com LinkedIn: https://www.linkedin.com/company/rstu... Twitter: https://twitter.com/rstudio

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So S&P Global really has basically five business lines now. There's the Rating Agency, there's the S&P 500 and the Dow Jones Index Business. There's PLATS, which is an energy focused basically side of the company, like a lot of research on fracking and oils. And Market Intelligence, which is basically kind of like a data analytics vendor. And then the fifth one is relatively new, it's called Sustainable One. It's completely an organization focused around ESG data and finance. So in that area, I run the new product development. So I'm technically under the product management group. So I leverage a lot of the technology folks in data science, a lot of folks in what we call content, which is like automation teams that do more RPA work. And of course, my own team, which is technically generally more quantitative, but in a more user oriented fashion, I should say.

So when we say when I'm talking about new products for S&P, it's typically products that take about six months to a year to go to market. So they're not like feature enhancements or things like that, that are like much more transformational. So for example, we've launched a financial statement extraction system that basically take a document that's completely sitting on your desktop, and then you upload it and it becomes part of our platform. Our sentiment analytics is for Chinese companies in the native language, simplified Chinese. So things like that, that are not typically what you might necessarily associate with the S&P sort of product offerings. So yeah, we use a lot of, you know, we have R in production. We also use, of course, a lot of the NLP tools in Python. So we kind of cross all the usual buzzwords in the data science world. So hopefully that gives you an idea.

Excitement in data science

Yeah, sure. I mean, yeah, like I started kind of in the quantitative side for about like in about two decades now. So the exciting part I see in the data science community is the ability for folks to get a lot closer to the domains that they're working in. And I think a lot of the work that RStudio has been doing, especially on like making building applications easy in R actually helps sort of get that point across. Like in my world, obviously, in the quantitative world, you have a lot of folks who kind of just sit behind, look at numbers. They don't know what those numbers mean and how their client will use them. So I've seen over the last at least five years, there's a lot more transformation there where the sort of the upcoming data scientists actually try to think beyond the numbers and how the client will use them.

I think more and more technical basis, like ability to like visualize information very easily with like very few lines of code. Like a lot of what I do on the prototyping side, we use Connect, we use Dash. But the net result is there's very few lines of code that you basically have to write in order to get something that looks like a legacy, like a JavaScript based platform, but kind of actually more cutting edge, I would say. And that's important, especially when you're trying to kind of get to the last mile, because typically in bigger corporations, to build sort of those type of applications, they go on this Chrome agile release cycle takes quite a few months to get to that point just to show a client something interactive.

And then the other thing that excites me, third thing is I think more on the machine learning side that neural networks, obviously very, very big black boxes, but they can mine a lot of data and take advantage versus say structural models, right? The explainability side of it has become a lot more, I would say a lot easier to actually do. Like we use Lime a lot for model agnostic explanations, like simple things like that kind of go a long way to sort of like, here's a nice time series, but why? Like that question gets kind of answered nicely, and then ties back to the visualization, because we kind of build things that try to simplify that answer without having to, without having the customer to reach out to support or to other sort of sides, or, or worst case is having the client have to jump on a call. You kind of want things to be more on demand self service.

Sharing applications and communicating with clients

Yeah, yeah, that makes a lot of like, so for me, it's two sides, right. So one, like internal where and by internal folks, I mean, like our salespeople demoing something to a prospect. So we just, it's a new product, we're still kind of assessing the market appetite, the addressable market, basically. So we build it in house, you know, comes over the VPN, it's going to simple URL and connect, typically. And they would go out and talk to clients and showcase basically the visualizations, well, not really visualization, but like the workflow. It's really workflow based tools, right. And then depending on the feedback, that's when we kind of figured out what the best channel to put it on, whether it's a feed, or it's a desktop offering, right.

The other one, which is more interesting, at least to me, is that the external piece. So we have done, we do, we have something called the incubator at marketintelligence.com. That's an external facing, basically, pre-prod environment, if you will, it's, we can sign up clients and prospects on it under some, you know, SLA type of deal. And that is completely, whoever you sign up, they go in and do their business using that application. And that's also serviced by, in some cases, Connect, depending on how many users we have concurrently, and other cases by like Dash and Plotly. But again, all of those have some lifetime to them. Eventually, they, within six months to a year, they go back into the platform offering, which is all in JavaScript, basically.

And thinking about those, I guess, as follow up to that question, I see there's a question on Slido. Do your clients set standards for which tools and approaches they need you to use? Or do they just care about end results, and you're free to choose?

It depends. They have standards more, not, I mean, not the workflow, necessarily, workflow needs to be, quote, unquote, intuitive, right? So that unfortunately goes through a lot of like iterations, right? Until it becomes intuitive. But like, yeah, depending on a product, so like, in like credit analytics, for example, you can't just basically arbitrarily decide to build a neural network to do, you know, implied ratinging. Because that comes with, although it's not a ratings product, it is not a ratings, S&P ratings product, it's lowercase letters that, that goes through a lot of audit, a lot of stability criteria. So that you can't just arbitrarily decide and change the model, like, but yeah, so they're like NLP things that I don't think they care, like we have the text and the transcripts filings that we have that does basically takes an earnings call, converts it into, you know, machinery double text, like from voice, and then runs a bag of words, regularized bag of words against it. Now, that could have also been done with a neural network, just regularized bag of words was easier to explain. The China sentiment analytics, on the other hand, is completely neural network based, right? And with like, model explanation on top of it.

Biggest challenges

I mean, right now, honestly, I would say the biggest one is hiring, like, we need to hire more sort of qualified data scientists. Like, we obviously have a team, we have several teams. And I think that's one of my biggest, like, items to address for basically 2022. On the more technical side, I think explainability. Like, this is a problem, especially, like, as we start dealing with this big data sort of purge, or like this ability to mine big data, for lack of better term, like, it's always nice to see a time series of some input, but the reality is the end user typically won't take that as gospel, right?

So, trying to kind of explainability, like, so, trying to kind of explain why, and especially with the subjectivity part of it, like, you know, we have folks who have an opinion about a particular market and other folks who have a completely contrary opinion about the market, yet you're trying to set something that works for both. I think trying to be very transparent and sort of unbiased, this is important. And that's where the model explanation comes into play. At least so long as both sides understand where you're coming from and it's defendable, it's self-consistent, then you don't run into this sort of problem. But that is sort of evolving. Like, in finance and, like, financial engineering, I mean, you can tell, like, there's some form of absoluteness to the values that you produce, right? Like, a lot of structural models, a lot of implied implicit assumptions.

I think the question, there's no such thing as a better model. It's more like, what's the context, and is it self-consistent, right? Those are the two things that you want to keep in mind.

So, there's, you know, somebody wins, somebody loses, right? In some of this sort of weak predictor area, there isn't that. The only thing that you can kind of tell is that I've looked at something, and this is how it works, and this is why I put it at that level. I think that piece, people forget about, like, especially sentiment analysis. They think there's some absolute measure, but really, it really depends. I think the question, there's no such thing as a better model. It's more like, what's the context, and is it self-consistent, right? Those are the two things that you want to keep in mind.

Hiring challenges and the R-Python split

You know, I think it depends on what you're looking for, right? Like, that's part of it. I mean, on both sides, the applicant and the, basically, the company that's hiring, the hiring company. I think some of the applicants, I mean, SMP is a pretty big company. They just don't want to work for every company, you know, so that you kind of run into that problem. On the other side, like, the teams that I deal with, they have to be, like, basically, they have, I think they need to know both R and Python well enough to at least articulate, at least to use them for, basically, exploratory data analysis, and R and Python, a little bit of, basically, distributed programming, and that's very rare. I think, generally, people gravitate out of grad school, they tend to gravitate to one or the other, and it's, like, hard to kind of switch that mentality. I think that's, I think those, at least in my experience, I think those are the two major roadblocks to that sort of mismatch.

I totally agree with everything you just said, right? That division, that R-Python split. No, you use, you know both, you can talk to both, because you'll need different tools for different projects, and then, ultimately, I'm just wondering who on your team is thriving, right? What makes a team, because you're talking about all these different projects that you're doing in sort of different ways. It feels like your work isn't as streamlined as, probably, we all want it to be at this point, right? It's evolving, it's developing. Can you describe some of the folks in your team that are just phenomenal and doing really well in that matter?

Yeah, no, no, I hear you. Yeah, I think a lot of it is, like, for, again, like, the ones that are thriving, I mean, they come in with a pretty strong technical background, both in math and programming, right? So, the ones that kind of move up faster are the ones that understand what you're trying to do with it more, because, as you can imagine, we're, you know, in some days you're working on some, you know, sustainable development goal thing, and another day you're working on some sentiment analysis, and another day you're working structural model. So, the ones that actually get what you're trying to do, we put them in front of clients, and they thrive by becoming a bit, again, a lot more product management. They're able to explain it as well as a product manager, and, but they also, of course, have that technical background. So, those are the ones that move up.

Then the second side of it is the management piece. So, because we are, my team is relatively small, we rely on some of the IP folks, we rely on the content, like the, what we call contents, like data acquisition, and those are also technical folks. So, being able to sort of manage those sort of dotted line teams and still meet your deadlines without slippage, those two things kind of move somebody up on the scale.

Yeah, yeah. I typically, I would do grad school above. I don't really specifically look for PhDs, not in a bad way, it's just, you know, it depends what they want to do. ability to learn quickly, I mean, you know, the languages change, you know, the business changes, being able to adapt is more important to me than basically, you know, I mean, obviously, there's a bare minimum, right, but between that and just, like, pure technical, I'd rather have someone who can adapt, to be honest.

Non-technical skills

Non-technical, yeah, like, so a lot of it is more like presentation skills, right, being a little, you know, articulate, like succinct, right, some writing skills, like we do a lot of thought leadership and things like that. So, just trying to sort of, again, do like, used to be called techno-marketing, but white papers, you know, some form of white paper type things. So, there's that level that you want to have. I mean, honestly, not all of it comes, like, out of school, you're not going to have all that. So, typically, we end up bringing the junior person on calls, mute, just shadowing to see how that articulation, and lately, I've been putting folks more on internal calls with, like, kind of pure business side folks, business side colleagues, and kind of, where basically, if they make a mistake or say something kind of funny, but it's not, you know, there's no real problem. Eventually, they kind of graduate up to, like, a client call, right?

Yeah, again, I don't expect people, like, at least, I don't really, if you were working other firms, I do kind of expect some level there, but if you're coming out of, like, grad school, I wouldn't necessarily, I'm not super worried about it so long as you can kind of adapt. A lot of listening, actually, is more.

You rarely jump on a call where everybody's just glowingly happy on a client call, you know?

Job descriptions and the hiring process

Sure, yeah, and that's something I encountered a lot. Recently being promoted into a leadership role for clinical data science, what we do in clinical data science is pretty specific, actually, and so going across different job ads, job boards, seeing different job postings, there's always some disconnect between the data science hiring manager and the recruiting partner, as well as how the job ad is written, and so, you know, one good example brought up earlier today and brought up in earlier sessions, like with Julia and then with John Thompson a couple weeks ago was, you know, and just mentioned earlier, the idea of Python and R together, interoperability, and that not being stressed enough in the job ad would get past the recruiter into the data science hiring manager, and, you know, they really just want Python, but they had R just as prominent in the job description with Python, but as a primary R programmer that I am, primarily what we use on the clinical side, that now, you know, we've wasted a recruiting interview and a hiring manager interview in something that could have been weeded out through the job description alone.

Yeah, I mean, I don't have a solution, and we try to do the same. We have the same problem. Like, we end up on the recruiter has a sort of an HR type of call, but by the time the resume hits, it's basically almost a buzzword match of what you have, and then basically they offload it to a first line of defense, which is like some person in my team would basically have an interview, and then we ask the questions that you were asking in the beginning to try to fish out where the applicant stands, and then if they graduate from that, then they go into the sort of second round and third round. Yeah, I don't like it's stuff, right? Like, what I try to do is reach out to, like, the network, like, obviously folks in RStudio, folks around that I know, and try to say, hey, because they're more than likely will have applicants that at least understand the domain, so it makes it a bit easier to kind of weed it out. But just generally posting, it's just basically, unfortunately, brute force. You just go through it, and I don't really blame the applicant or the recruiter. I don't expect the recruiter to know it as well as you, Ian, for example, right?

Selling the value of data science internally

Yeah, I mean, yeah, no, I had to do a lot of selling, like, both on the language and the value proposition. I mean, it was funny because they, like, yeah, data science as a term, and, like, we created teams after that, that also came from another acquisition we had. So, they kind of, there was a bigger top-down sort of initiative to scale up. Like I said, technically, I'm not data science and SMP. I'm product management, but I happen to have data scientists report to me, right? So, but, yeah, selling the value proposition there was a little tricky. And I think it's still tricky because some of it, some of the more, like, you end up basically trying multiple things and failing sometimes. And the business doesn't always like that, right? They want you to, they want to buy data, do something with it, and then make money off it. That's basically what we do, right? And basically, buying data and then finding out it doesn't work is usually, like, you must have done something wrong, right?

What we've done, what I did was, you know, we actually presented to a pretty high up on the food chain. I still remember, like, our markdowns when they first kind of came out. And we were lucky that our CEO at the time was, he's still there, like SMP Global, Doug Peterson, basically. He basically was an analyst at Citi back in the day. And so, he looked at the sort of almost financial research report, and it was very interesting to him. And I was like, oh, wow, I can do this now? Like, you don't have to, you can actually pull the data, do the model, do it all in a markdown, and have the analysis all written in one simple thing. And I was like, I'm not saying that sold everything, but at least sold my team.

So, I mean, it's a little harder now because, like, folks are, the threshold has gone up, I think, a lot more because I've seen, but, I mean, my suggestion to, like, selling to the business is always more like, oh, the value proposition of data science is, again, like, don't just throw, like, 10 different R&D efforts out there. Pick the one or two that actually succeed. And then for those ones, sort of do, like, what I do with the incubator, like, build a Shiny app and do it the way you think a client would do it, and then show that. Because that sells it more than just saying, I can do, you know, you know, a soup of basically technical jargon, right? That doesn't help anybody.

Pick the one or two that actually succeed. And then for those ones, sort of do, like, what I do with the incubator, like, build a Shiny app and do it the way you think a client would do it, and then show that. Because that sells it more than just saying, I can do, you know, you know, a soup of basically technical jargon, right? That doesn't help anybody.

Most unique problem solved with data science

Well, I would say the China sentiment. I really knew nothing about basically China, like I said, you know, beyond the news and stuff, right? And that was like, took about three, four years to get to the point we're at now. And I like, and the funny thing is none of my team members spoke Chinese. So, and we had to basically build something using the 20,000 or so companies. And the amazing thing is there is a lot of data that they collect. And because we're SMP, we can access that data. And in some cases, a lot of it is public domain, but actually trying to sort of understand that business and that context and make it easy for the rest of the world to sort of have some sort of window into that area, especially when I really didn't know as much or anything really. That was quite, it's very idiosyncratic, right? And then we hired, obviously, labelers. We had some folks who are in China who specialize in that market, help us. So that was very, very unique than what I'm kind of used to doing, right?

Yeah, basically, if you generalize it, it's just that situations that are very idiosyncratic, that happens to be the most idiosyncratic one I think I've done in about the last 10 years or so, right? You actually have no idea. So there's other things that are more like just problems that are more like big data problems. Like I'm working on something on transactional data with credit card payments, just like terabytes of information that you've got to mine. And then that also presents its difficulties, right?

But yeah, like something complex. Again, a lot of it was spent understanding what we were trying to do and for whom. And then from that point on, it became kind of almost a linear set of instructions to basically execute, at least like almost a decision tree so you can exclude certain options. And I learned the hard way that you see a lot of like those prepackaged NLP tools out there, they are not really, they're mostly on the sort of the English, Latin sort of side of it. Like that area, there isn't really anything that, we use Fastex, for example, and that was really bad actually for finance and simple white Chinese. Because they're trained on a lot of Wikipedia data. And when you start putting in the language plus the context, they kind of fall apart. So if you're just doing something that's sort of generic, hello, you know, kind of NLP type of problem, it's fine, right? But once you start actually trying to use it to do something more specific, it falls apart. And then we had to do our own basically vectorization for it.

Project ingestion and data readiness

The question is really on how you deal with project ingestion, right? As data scientists, we get multiple requests for projects. And then all of them are urgent for the business. And when we get into it, that's when we find out that the data needed to make the decisions is either completely missing or partially missing, or not enough to make the decisions. But that's only several, sometimes several months in, right, where our data scientists are spinning their wheels, trying to like, you know, put things together, this thing in a Postgres database here, there's an API there, there's like PDFs, you've got PDFs sometimes that we have to scrape. And it gets frustrating because a lot of data scientists really want to focus on the modeling aspect. And we all recognize that 80-20, you spend 80% of the time on the data. However, there has to be the balance where if that's all we're doing, and all the time we're doing is that's all we're doing. And, you know, all the data scientists are very capable, right? So just because you can do the data cleanup, and you know, data munging, that's not your primary role, right? So then, you know, folks do get like burnout. And so we are trying to develop a better ingestion process for how do you accept projects and how do you evaluate data readiness before we engage the data scientists and launch it formally as a data science project.

Yeah, so like, yeah, like, like, again, because I'm running a small team, my opportunity cost of someone failing, not because of them, but the data is actually high. So I don't have a like a bulletproof way to sort of prioritize projects. I mean, generally speaking, we try to look at revenue potential. Is the data completely new? Or is it something that was new? And is it new and can be linked to something we already have? Generally, if you get something that's sort of independent from the rest of the database, like your value proposition is going to be much harder. But if it's something that is related to some of the companies we cover today, and they have other measures, then there's a higher success chances. But yeah, it's happened to me where you basically work on something for like, six months plus, and then you end up with like, well, I guess it didn't work.

It's unfortunate, but I mean, it kind of comes back to the business side, because you kind of try to whoever's adding this project, you try to do your best to have them explain in very simple terms, what they want out of it. Right. And that's where I kind of use that sometimes when things kind of fall apart, they're not very well defined. And it's like, well, we really didn't know. So we tried and it took a while because of these complexities. And at least we found out that why we failed. So that's not really an opportunity anymore. So that way, you kind of dance a bit around the issue, right?

And burnout, I just put a lot of folks on like, at any point in time, any members working on at least a project and a half, or maybe two. So like, you know, they're taking one end to end and then the other whatever residual is worth supporting someone else on my team so that they don't get burnt out too much. Right?

Handling dashboard maintenance and handoffs

The other question I have is, is we sometimes are falling victims to our own like success and capability, because the initial phases of a project are often very descriptive in nature, right? We have, like, we deal with a lot of engineering data that engineers are not looking at. Right. So to build, you know, we focus on forecasting models and getting those insights. But the very first part is just all the descriptive analytics that has to happen. And then, you know, we then end up creating dashboards and show them that, hey, this is your data, these are, you know, like the visual insights before we even touch the modeling, which they love. But then we also end up, like having to maintain those long term and be, you know, end up becoming the dashboard maintainers, right? Which I see the value in it. But it, you know, making a one time dashboard to show insights and then focus on the modeling is very different from now maintaining a production grade dashboard, right? That if anything goes down or the upstream data changes slightly, suddenly like this, you know, you know, drop everything and fix it, because there's so many people looking at it. How do you face similar issues? And do you have or do you have teams that like, take over dashboards? And now they're, you know, that's their job is to just maintain these production widgets?

Yeah, no, yeah, that's funny. It's like, yeah, no, yeah. So, so, like, so, like, I tried to, and then this has happened over time. So it wasn't like, I didn't really think about it, and then structured it, just fell into problems and then tried to address it. Like the client facing part, the incubator I own, so that I'm kind of tied to. So, you know, so I don't like, we try to write it, we have some IT folks sort of help, but we kind of, now the internal stuff, generally speaking, once there's some success, and there's no way to, like, there has to be a way to production, I turn it over to the, like our tech team, obviously, with the product, the segments, product manager, and then they work together to get it done. And then I shut down basically, the old app has a lifetime. So I don't, I'm exactly not on the hook, basically keep maintaining it.

So, you know, you do two, three releases of something in production, because we wrote some service endpoint that's basically supporting it. And it just happens that the software engineers don't necessarily have the same background. And I'm talking programming background, not math. Right. And that's kind of where I'm kind of focused on now is getting that team up to speed so that they can actually not involve my team and feature enhancements.

But yeah, I mean, you know, like, what I end up doing is kind of just allocating folks directly to them and saying, you're owning it now after this release. So you have to sort of, we'll hear to, you know, you have to shadow, get your hands dirty, and then basically take it over eventually. And anything that has a path to production that's already defined, I've gotten the feedback, I shut down, basically.

Impressive deliverables and the financial extraction tool

Oh, the, the, we have that, the financial extraction tool I was talking about in the beginning, where basically you take an unstructured PDF of a small business or a private business, and you upload it into our system. It's like, literally sitting, like, you just got it from an email or someone else. And then in a few minutes, you click through a few points in the desktop. And then all of a sudden, that data that you care about, those financial line items across the balance sheet income cash flow are already in our platform with assignment, with the mnemonics, like with, with the metrics that we create, our company hierarchy, our taxonomy, in like a few minutes, that's there, right? That really resonates a lot with basically the, the business. And, and that's like, kind of an ongoing sort of product that we're, we're, we're expanding further and further into.

It's a very simple message, right? Like, you've got something that came from someone else. And then all of a sudden, it looks like it's in your desktop. We have, we do this in a systematic fashion for our public company data, fundamentals data, right? That's like, lots of teams, validating, checking, NLP, sort of extracting and all that, right? But it's not like, in a way, it's organized, there's a much more rigid pipeline here, you're talking about something, you know, you just emailed me, and I can just upload it as long as it's a financial statement. So that, that message is simple to understand and resonates.

Well, the prototype was in Shiny, like, you connect back, but now it's in our desktop life. So yeah, you're getting basically like a financial statement of a small business. And then you're taking out like total revenue, net income, EBITDA, like all the measures that you care about, but you don't actually have to highlight or do anything. You just upload the document and it'll extract out the line items and then takes those line items and assigns them to our, like, taxonomy, which is an SMP, you know, over decades has been built in this taxonomy. And so in a few clicks, you're basically take something that never resided in our platform, and all of a sudden it looks like it's part of our platform.

Unlimited resources and the data repository

That's a good question, actually. Yeah. I think there's a, like S&P, like we, I'd love to basically dig into our full data repository and identify what can be linked efficiently to kind of target, like we have so much data that we don't mind. Like the China sentiment data, for example, that was, somebody else bought that a long time ago as part of something. And it was just sitting there on a folder. So I would love to sort of do more projects like that, where I would look through our full repository and see what we actually house today without needing to acquire additional data sets. And then seeing if we can kind of combine what we have today to something more useful.

Yeah. So we have data lakes, like we have a lot of data lakes, a lot of data warehouses. We have a basically structural way of like keeping track of it. The problem is it loses that business context, right? So it becomes kind of a, one segment buys the data, it gets linked, and then that segment uses it, and then we forget about it. And it just sits there, and there are obviously technology people and data people supporting it. So I would just kind of go on a fishing expedition, basically, go through it and just see what can come out of it. Because I would hazard a guess, we're sitting on a lot of things that probably cost us money to maintain, and we just probably use a fraction of its value.

Advice for aspiring data science leaders

Yeah, I mean, we kind of hit on it earlier, but like, yeah, I mean, really understand the context that you're trying to solve for, right? Like, I mean, obviously there's some level of technical capabilities you need to have, right, both in math, statistics, and programming, but after that, to move up, you really need to understand what the business use case that you're trying to solve for is, and in very simple terms, right? And that's, I think, what makes it sort of a quant versus sort of a leader, right, is understanding that context, and, you know, and you should question your own numbers a lot, right? Like, there were some, a lot of times, like, it's like, oh, look, the correlations are like this. It's great, but if you see something that doesn't look like the norm, question why, and then dig into it, and you should be ready to explain why it's not like the norm. I care more about those outliers than otherwise. So, like, it really is kind of questioning your numbers, and then understand the context that you're applying it to.