Resources

Data platform modernization in insurance | Kshitij Srivastava @ Milliman | Data Science Hangout

We were recently joined by Kshitij Srivastava, Director of Technology at Milliman to chat about data platform modernization at insurance companies. Speaker bio: Kshitij leads technology and data operations for the Life and Annuity Predictive Analytics practice in Chicago. Kshitij's work includes data infrastructure management, data pipeline development and security operations for Milliman's industry leading experience studies for VA, FIA, RILA and Life products. Kshitij's work with Milliman clients also include leading data operations and actuarial modernization initiatives that have been outsourced to Milliman. Kshitij routinely consults with actuarial and data teams within the industry and advises on cloud migration, actuarial modernization, AI, and data governance topics. Kshitij also leads a multidisciplinary team focusing on data engineering, devops, information security, software development and testing. Prior to joining Milliman, Kshitij has worked as a consultant focusing on machine learning and software development for a major analytics services company and as a data scientist for a major technology company. ___________________ ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co LinkedIn: https://www.linkedin.com/company/posit-software The Hangout is a gathering place for the whole data science community to chat about data science leadership and questions you're all facing that happens every Thursday at 12 ET. To join future data science hangouts, add to your calendar here: https://pos.it/dsh We'd love to have you join us in the conversation live! Thanks for hanging out with us!

Aug 21, 2024
53 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everybody, welcome to the Data Science Hangout. If we haven't met yet, I'm Rachel, I lead customer marketing at Posit. Posit is an open source data science company building tools for the individual team and enterprise. I'm so happy to have you all hanging out with us today.

The Hangout is our open space to hear what's going on in the world of data across different industries, connect with others facing similar things as you. And we get together here every Thursday at the same time, same place. So if you're watching this as a recording in the future and want to join us live, there will be details to add it to your calendar below.

And I love getting to see all the conversation and connections being made in the chat. So I just want to remind people if you are interested in connecting with others, I want to encourage you to say hello in the chat and introduce yourself, your role, maybe some of the things that you work on or things you do for fun. We're all dedicated to keeping this the friendly and welcoming space that you all have made it. If you're hiring, please feel free to add any open roles in the chat as well.

It's also 100% okay if you just want to listen in here, although we love getting to hear from you live. So there's three ways that you can ask questions or provide your own perspective. So you can raise your hand on Zoom and I'll call on you to jump in. You can put questions in the Zoom chat and put a little star asterisk next to it if it's something you want me to read out loud instead. And then lastly, we have a Slido link where you can ask questions anonymously.

And so with all that, I know you all have heard that spiel a bunch of times so far. Thank you for joining us. I'm so excited to be joined by my co-host today, Kshidij Srivastava, Director of Technology at Milliman. And Kshidij, I would love to have you introduce yourself first here and share a little bit about your role, but also something you do for fun too.

Yeah, thanks, Rachel. Thanks for having me today. Good to chat. Great to meet everyone today. My background is in data science. Spent around, closing to a decade at Milliman, but started my career as a data scientist and my focus was, I mean, started in the big tech firms, but moved to insurance companies or insurance consulting to be specific, you know, very close after I started in data science.

And when I joined, I was working really with actuaries. So not sure how many folks here know the actuarial workflows, what actuaries do, et cetera. So I'll give a fair background about the actuarial profession first, because, you know, that's very important to my own career. But yeah, in Milliman, it's a traditionally actuarial consulting firm, but we're trying to expand to sort of serve other areas of the insurance business from actuarial now.

My own role has sort of started from data science, moved to data engineering, now leading a team of data scientists, data engineers, you know, DevOps technology professionals on creating applications for actuaries. So I'm happy to talk about sort of data science applications and within the broader insurance space, but more kind of specifically on the actuarial space.

Oh, yeah, I mean, my work has, in the last couple of years, I'm sorry to say it's been very hectic. So I've been moving geographies quite a lot. And that has left very less time for me to sort of engage in my hobbies. But you know, when I have free time, I like to be outdoors. I like to hike. I like to sort of go out. Very recently, I think I've actually met some folks at the Data and AI Summit in San Francisco. So I remember while I was there, I had one day which was, you know, I didn't have any meetings there. I took an 11-mile hike and, you know, it feels really good.

But yeah, I feel like outside of work, you know, that's what I do. I have two young girls, so I spend a lot of time with them at home, sort of, you know, hanging around. And both are pre-K, so spending time with them, sort of preparing them for kindergarten.

Insurance data science overview

So I think it probably, I mean, I don't know how many of our attendees are from the insurance space or have kind of exposure to the insurance space, but I'll briefly go over the insurance space in general, sort of what are the various functions at an insurance company, how data management, data sciences can help in, you know, decision making at each of these functions.

And then I'll jump briefly to sort of what has been a recent trends that we've been seeing in this space and cover the areas where innovation is happening and sort of where we see opportunities. So kind of going back to the basics, an insurance company, so in general, insurance is pooling of risks. So there needs to be measurement of risks to be able to ensure it. So I think that's typically the actuarial function within the insurance company is responsible for collecting data and then sort of defining and measuring risks.

And this happens through sort of collecting a lot of data, collecting external data on and then actually sort of define actuarial assumptions, which are basically what are the risks that we're seeing with this particular product. So let's say if it's a life insurance product, there are various types of risks. It could be modality risk, which is, you know, one of the primary risks with a life insurance product, but also lapse risk and your premium persistency risks, risks that people who have signed up to pay premiums will not pay premiums. People will lapse out of policy. So these are various types of risks that each insurance product has.

And so like to study and measure each of these risks is a very data intensive process, as you can imagine, specifically if in a lot of situations, companies are lucky if they have prior data that they can rely their assumptions on, but in a lot of situations where a company is interested in launching a new type of product. So we're kind of offering new types of guarantees that has not, you know, that the industry has not seen before, or there are no prior examples for them in those situations, it's tough to have data to back your assumptions.

So typically there are sort of a lot of data science use cases in that area, part of the work that, you know, Milliman does with a lot of our clients is to help them design these behavioral assumptions. And so we use a lot of machine learning models, we use sort of traditional statistical regression models to study, I would say our usage is more on the traditional statistical side rather than sort of more machine learning oriented, because in general, actuaries typically want to understand sort of the drivers of risk and sort of put a formula to it rather than, you know, rather than sort of walking through a black box.

So I'd say we typically sort of help companies decide these assumptions through the use of data science. That's primarily, you know, what I got started doing at Milliman, this was 10 years ago. Now, so that's the actuarial use case. Now there's multiple other things that happen. So once you have these assumptions of how modality risks would look like, there's typically actuarial modeling systems that actuaries use to sort of project these risks further down the future. So 30 years down the future, what are the results that I need? What is the presence value of the premiums that I'll get?

So actuarial modeling is another very deep area in actuarial studies and you know, typically my company helps with those types of models as well. Now, these are not machine learning models. These are cash flow projections models. So we're taking sort of some assumptions and the present state of the policies and then projecting them through various economic scenarios, you know, 10 years, 20 years down the future.

So traditionally, I think that's also a very data intensive process, but I'd say there's less data science there in the traditional sense of what data science implies. It's more simulations, it's more, you know, projecting things. So I think insurance companies who have their own inbuilt actuarial modeling systems are a very, very big users of compute, of cloud compute. And so in each of these projections, there's very large amounts of compute that's needed.

So if you're aware about systems such as Integrate or Atlas or GGI Access, these are traditional systems that are used for cash flow projections. But more modern systems, you know, we started sort of doing some of this work on five, six years ago on RStudio Posit Connect platform. And I know that a number of our clients are thinking about using more modern database systems such as Databricks and Snowflake for doing actual projections. So yeah, that's an interesting area. You can see sort of the innovation happening there.

Now if you move away a bit from actuarial, there's underwriting. So I think the way risks are measured and sort of is through underwriting, there are underwriters within the insurance company and underwriting is a great example of how data sciences is operationalized, used within an insurance company. In general, I work within the life and annuity space. So in the life and annuity space, underwriting is, as you can imagine, very data intensive. And typically in this space, there's lots of usage of third party data sets and like unstructured data sets.

So companies might be relying on sort of RX data or prescriptions or, you know, medical history, financial history, those types of things that are typically obtained through third party companies. And then sort of using very advanced machine learning models to determine risks. I'd say the use of less traditional statistical models, but more sort of machine learning approaches is prevalent in the underwriting space. Typically, if you move away from the life and annuity space to the property and casualty space, that's where you see a lot of innovation happening in the underwriting space with the use of images and videos and, and sort of, you know, geospatial data and whatnot.

Shiny applications in production

So Shiny is something that has been quite transformative for our team at Milliman here. And in fact, some of the applications that we've developed on Shiny is actually in production used by, you know, some of our clients to study their risks. This is, I would say in both areas, typically what we have done is we have put the results of the models that we develop on policyholder behavior. So things like how would people lapse out of their insurance policies? What is the timing of when people will start withdrawing from their annuity contracts?

So typically these are the models that we develop on a large amount of data that we collect from insurance companies. And then we allow these companies to visualize the model. So let's say you can visualize how withdrawal rates or lapse rates within your own company versus in the industry. And that kind of gives you a sense of the type of risks that you're insuring and maybe some actual decisions to be made later.

But yeah, so for those types of use cases where we're putting the results of a business process, like an actuarial modeling or policyholder behavior models, and then letting actuaries sort of play with that data through a dashboarding setup. I think Shiny has been quite helpful. We have tried other types of dashboarding platforms, but the flexibility that Shiny allows us is amazing.

So one of the things, for example, is when companies want to look at their mortality data, they want to sort of define their own custom buckets. So some companies want to define buckets from 10 to 20, or maybe 10 to 20 is a wrong assumption, maybe 40 to 50 years old. But then some companies want to study their assumptions in five year thresholds of 40 to 45. Now Shiny allows us to sort of programmatically set up these things in a way that's quite natural.

And in Shiny, you can also sort of generate UI components programmatically that I think we've found that functionality very, you know, amazing and kind of compared to sort of some of the other alternatives, which has Tableau and Power BI. I think there's more flexibility here.

We have tried other types of dashboarding platforms, but the flexibility that Shiny allows us is amazing.

So yeah, to answer your question, Rachel, we've used Shiny a lot in displaying the results of our studies, but to what's the other extent in actual modeling also, when an actual model runs, these are thousands of economic scenarios of how, like what macroeconomic situations would be there. And then for each of these situations, there's a projection 30 years down the line. We've used Shiny to sort of let actuaries visualize the results of these projections also.

Statistics vs. data science in actuarial work

My question is actually kind of high level. It's a little bit broad, but being in, you know, in your industry, it's a pretty good intersection and given what you do, it's about the intersection of statistics and data science. So my question was, have you noticed that newer data scientists, as you might know, you know, for the last like 20 years, anybody who's a data scientist now probably came from another field. So are you finding that newer, that data scientists now that are minted data scientists have those degrees that they have sufficiently strong backgrounds in statistics to keep up with actuaries? Or are you finding that the two fields are kind of becoming distinct enough that it makes the search process for candidates a little bit different?

Yeah, that's a great question. In, so our team has both actually. So we have traditionally actuaries. So actuaries are folks who, you know, study statistics and it is their training, professional training to sort of fit new types of distributions into statistical models. And our team also has data scientists who are sort of from the other side of the lens. Some of them have studied hard sciences and kind of transitioned to the data science field, or some actually have in the last 10 years, there's been a lot of university programs that have structured training around data science.

So I would say some of these university programs and sort of, I'm sure Posit has done a lot in terms of data science education. I think that has really helped in terms of the quality of candidates that we've been seeing, but going back to your question in terms of the actual skill set, in terms of statistics, and then how data science, newer data scientists compare to those, I'd say traditionally actuaries do spend time with a lot of parametric models. So things like traditional regression techniques.

And I mean, one of the advancements that we have seen in this space and in the space of policyholder behavior model from the last year was when we started using penalized regressions and, and you can imagine that is penalized regression is as old as time. I don't know since when people are sort of using lasso regression and whatnot, but it is only recent when actuaries started using those types of techniques to study and quantify policyholder behavior risks. So the actual profession, but because of all the regulations and the SOPs surrounding it is very slow moving.

I'd say the data scientists are more sort of excited and more, they bring the machine learning side of things more to the table, which has been quite useful for some of our work. So for example, I talked about policyholder behavior models in the context of penalized regressions. I remember that, you know, a data scientist on our team has kind of taken that and put it into a tree-based model and sort of also that led to quite some insights around sort of which additional variables are important to predict that outcome.

So I'd say data scientists are more faster moving or more sort of keeping up with the latest trends in the actual models and sort of statistical models and actuaries are doing the catch-up game now.

Third-party data and vetting

For, so something that we've, an initiative that our team was working on in the past year was to, so all the data that we have corresponding to how policyholders use their policies, that's a longitudinal data. So you have like for one person, you have an entire history. Now we wanted to use that data set, condense down to one row for each policyholder and then try to develop a propensity to buy a model. So who is, for example, one of the questions that we're going to answer is which type of customers are prone to buy a GLWBs or a policy with lifetime benefits guarantees.

And we want to understand those customers better. And in terms of answering these questions, I'd say actuarial and insurance, the data that is there within the insurance company is limited. So we have to rely on sort of third party data to understand their behaviors better. I would say something that we have found useful is partnering up with some traditional data aggregators. Now, usually there are these data aggregator companies that sort of work with open source data or sort of data that is open out in the world and they scrape it or they have partnerships with sort of various data sellers.

And something that we have found useful is that when we work with one data aggregator, then sort of we get to sort of play with various types of data sets they have. Some of these data sets might have restrictions in terms of how we can use it. So, for example, with the recent, I mean, there's a lot of regulation in this space and it's been growing as we speak. So we cannot really make a decision on who to send a marketing communication to based on, like we cannot make an individual decision. I think that's restricted by a local CCPA, the California Consumer Privacy Act. So we do it on a group level.

But, yeah, going back to the question, the vetting process is through select, you know, some of these companies offer match rates. So some of these companies have like fuzzy matching and, you know, we can sort of send them some anonymized information and they can send us fuzzy matching rates. As you can imagine, like some of the names do not match a lot of times and addresses don't match. So we don't know if we're getting the right data or not. So there's various types of fuzzy matching algorithms that some of these data aggregators have that we tend to use to sort of figure out what is the amount of matching that we can get in terms of sort of what's getting the useful data.

Milliman has a practice called IntelliScript, which is, I think, one of the largest underwriting data providers in the U.S. And, you know, we have connects with sort of various health data providers and everything and a very elaborate pipeline to sort of obtain that data, clean that data and sort of make a product out of it. So yeah, being a data aggregator is a hard business. We have worked with a lot of data aggregators in the past. It is not easy to sort of form those data relationships and get those data sets. But those who have those relationships have an edge in terms of sort of, you know, the value that they can provide to insurance companies.

Deployment and release cycles

So I think because of the heavy regulation and the data protection, sort of our security posture has to adhere to sort of our contracts with companies who give us data for studying policyholder behavior. And some of these companies are very large, brick and mortar insurance companies, they have very elaborate security contracts. So I think what we have done and what we found useful is we've started getting a SOC 2 type 2 audit almost four years back, three years back now. And as part of that audit, we need to have a formal release process.

And our release process keeps on evolving and changing according to the needs. But I can describe it as there's a development server. So it's traditionally like a Posit Connect or sort of a RStudio terminal. It's not a server. It's on a shared desktop machine, Windows server. That's where development happens. And developers push those updates to the GitHub site. So there's traditionally three branches that develop beta and then actually for release candidate and prod.

And traditionally how things sort of go in that manner. So once there's a push to the develop branch, there's good rules that kind of creates pull requests to sort of have another developer review that code, sort of see, like not look at it on the server itself, but maybe just look at the code that was added and kind of approve it. And once we have that, then it goes through the develop server can keep kind of having a lot of updates, right? Without being pushed to a release. So once we're ready to do a release, it goes to a beta stage.

And on the beta server, there's like developer reviews, but also there's a formal review from the testing team. So the testing team runs that automated test suit. If it's a new feature, so in Shiny, you can probably add, maybe we added a separate section of the web app. So there's no automated tests for that. So we end up doing sort of manual tests there to sort of see if any other feature breaks or how the load is.

And then, so that's around the beta. Once we're sort of fine with the beta stage, we push to release candidate. Release candidate is maybe just to sort of check business functionality and load testing also happens on the release candidate stage. And that kind of leads to a production deployment. In each of these stage for our SOC 2, we need to sort of capture approvals, produce them for audits. And so our auditors make sure that every code push or every release is kind of thoroughly vetted and tested.

So that's typically how the release cycle is. Now, in some releases, in some situations, we may end up doing a hot fix. So that's a branch off of a production. And then we push directly to production, then bring that change back to develop. There can be rapid releases. So there's situations where we need to put up something quite soon, let's say for a conference or for a call or for a demo, then that has a separate release cycle. But in general, most updates go through sort of this release cycle.

Load testing and Shiny in production

So in terms of load, I think something that we've found useful is like our SDET team sort of after every release that the release is coordinated through Azure DevOps. And so that's where code is pushed to GitHub. And then automated deployment happens through Azure DevOps. Before we obtain a final approval on the production deployment, our testing team does regression tests, integration tests, and a load test sort of also. And load tests are, I do remember a package that was quite useful in load testing Shiny. I think it was Shiny load tests.

So Shiny load test is something that we've used in the past. But I think at this point, our testers are probably using things like Selenium to just run multiple hits on the site and kind of see the responsiveness of the graphs and charts. For example, through one of these load tests, we ended up optimizing our Snowflake queries that are used to generate these graphs. So I think load tests were quite useful there.

But we're lucky also because our applications are not used 24-7, the load is not that much. Actually, we tend to use it when it's an actual assumption setting cycle. So that's certain times in the year. But during those times in the year, yeah, we can have like five people at the same time using the site. So it helps using things like Shiny load test and then frameworks like Selenium to automate load tests.

MLOps and model development

So in terms of model development, I feel like it was back when we started developing these Shiny applications, we started following the agile methodology. You know, I'd say we don't typically sort of stick to it all the time. But as a general sort of North Star, I think that's what we try to do is to sort of push updates faster, sort of, you know, and, you know, have incremental updates and, you know, push things faster, but then sort of push hard fixes if needed.

But that also, you know, sometimes it doesn't align with sort of the actual profession at all, because everything needs to be tested before we put things out. So we need to sort of vet things before we sort of put it on. So sometimes the agile methodology sort of converts to a waterfall process for our software development.

In terms of models, I think something that we've followed is I wouldn't say it's agile, but in that, so there's a process where we try to improve our behavior models. And in that process, there's various types of new variables that we want to try and see if that is a good predictor. And so that's a cyclical process in there. And that doesn't quite align to the agile way of developing and deploying.

So I feel like we've tried to sort of educate ourselves about the modern machine learning operations, MLOps methodologies and sort of stick to it. So in terms of we've used something like MLflow, I think there's an internal proof of concept going on on MLflow where part of our team is trying to use MLflow to track a lot of these experiments, track how adding these variables would affect let's say lapses or withdrawals, and then kind of have a nice dashboard to compare the results of all of these experimentation.

And only after there's a decision on sort of what the actual parameters would look like that we go to sort of the next stage, which is sort of the fitting testing stage. And at that point, we also bring in a holdout data set to sort of test the models. There is a testing data set that is deployed right on the website. So that's not part of the training. So we let our customers see the results of that model right on the testing data set when it is deployed.

Innovation in the actuarial space

So I think I'd reframe that statement and say there's a lot of potential scope for innovation in the actuarial space. I would still say that it is a comparatively slower moving space because of age old practices and sort of, I think there's also regulations, there's lots of standard operating procedures that actuaries need to follow. There's standards of practices that are there for virtually every activity in an actuarial space like reserving valuations, pricing. There's very strict guidelines on what actuaries should do and should not do. So that kind of restricts sort of a lot of trying out new methods.

But I think what we've been trying to do in the space we're trying to get into is sort of use more modern methods and modern machine learning approaches to behavior modeling. We're opening up to sort of using some of the modern data platforms, like Databricks and Snowflake to sort of work on actuarial modeling. I feel like the driver of innovation in this space is primarily because, I mean, it is a data oriented space. There's a lot of data that actuaries work with. And as you can imagine, advancement in the data management and data sciences space is very rapid at this time. So I think actuaries are, as I said, doing the catch up game right now.

We're trying to modernize our methods, use more modern data platforms. I think the innovation starts from the data platform itself before we move to data science. A lot of these processes are in Excel files, which sometimes can become quite large, very hard to keep track of, to version control, et cetera. So in the life annuity space, the innovation starts from sort of moving away from those older Excel file based methods to more modern methods. You're leveraging cloud and Jupyter notebooks and Python and R.

I think the innovation starts from the data platform itself before we move to data science.

In the modeling space, if you move away from data management to modeling, I feel like there's increasing innovation happening in terms of using advanced methods, advanced deep learning methods to study policyholder behavior using third party data. And as you sort of bring in more third party data, there's various deep learning approaches. So if you're using image or sort of forms to inform your underwriting models, then a lot of innovation is happening there.

So yeah, the other driver of innovation that I've been seeing is generative AI. At least in the traditional brick and mortar companies, a lot of companies want to leverage generative AI and they need to fix their data situation first to be able to use gen AI methods. So a lot of scope, a lot of innovation is happening there in terms of modernizing the data management platform.

Moving away from Excel

Excel is a great tool for quick, dirty ad hoc work, right? And it's quick to sort of prove concepts and you can... The flexibility that it brings is kind of unparalleled. You don't have to do any setup. It's all self-contained and can handle considerably large amount of data. It kind of gives you the tool to do these things. But when you're dealing with very large amounts of data or when you need to have governed processes, I think that's when we found that there's a need to move away from Excel.

So for example, a lot of companies use Excel for actual real assumption setting. And these are very large Excel files with hundreds of graphs in them and pivot tables. And it's hard to keep track of. And it takes 15 minutes sometimes to sort of open one of these files. And so there's a case to be made to move away from Excel, at least for situations where there's a large amount of data and there needs to be a governed process to deploy the results of whatever is being done in that Excel.

So I think that's been the driver of the movement away from Excel. But I think you'd be surprised at a very large amount of companies sort of still using Excel in some of these very large processes like assumption setting. But as the data size increases, as they're kind of brought under scrutiny by regulators, et cetera, that's the push to move away from Excel to use some of the modern methods.

Career growth advice

Well, yeah, I think I mean, it was probably 10 years ago when sort of there was it is probably true right now there's lots of opportunities to sort of make impact if you're good with data within an organization. Something that I found useful for my own career mobility is to be flexible in terms of how to provide that business impact.