Novice to data scientist: a pediatric anesthesiologist uses RStudio to help kids access surgical ca
When a surgical procedure gets canceled, a child gains no health benefit, families' time off work and pre-op anxiety is in vain, and our not-for-profit children's hospital loses ~$1 per second. To understand cancellation, I needed to analyze thousands of patient records. Despite zero formal training, I learned to tidy and then visualize data – and even do geocoding and machine learning. Once we identified children at high risk, we could target additional support to their families. Furthermore, we showed that surgery cancellation contributes to health inequality. The R Studio/tidyverse ecosystem allows novices to do sophisticated analytics, and is helping us improve access to health care for the most disadvantaged children in our communities. Talk by Nick Pratap Full title: Novice to data scientist: how a pediatric anesthesiologist used R Studio to help disadvantaged kids access surgical care
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, my name's Nick Pratap. I'm a pediatric anesthesiologist. Taking care of kids having procedures done to their hearts. So you might ask, what am I doing standing in front of you all here today at PositConf? And it's a good question. So I'd like to tell you today about my self-guided journey into data science. I'd like to offer some shout out to some influences and sources of help that I've had along the way. I'd also like to maybe provide a little bit of encouragement to anybody who's on a similar path to me, or perhaps even if you know somebody or come across somebody that's on that kind of a path.
The problem of surgery cancellation
So the backstory is that in 2011, I came from London to Cincinnati to learn about quality improvement in healthcare at Cincinnati Children's Hospital, which is renowned for that kind of work. And they asked me to work on surgery cancellation. Immediately, I could find out that about 5% of procedures were cancelled on the day of surgery. And this actually equated to about 5% of operating time as well, which is important because a lot of the billing is based on time. So through this, we found out that the problem of surgery cancellation probably cost about $100,000 a week in terms of potential lost revenue, which is a lot of money. But just taking a more human side to it, it was really difficult and frustrating for families and for staff as well.
So I'd just like to take a moment to put surgery cancellation in terms of some dimensions of quality, which is what's from the Institute of Medicine, which is what people often talk about in healthcare quality. Even I didn't know until I came across this paper that about 30% of the global burden of disease needs surgery or other procedures under anaesthesia, either for working it up or for actually treating. So clearly, if surgery is cancelled, the effectiveness of that procedure is not going to be realised. I touched already on family-centredness, but I'd like to just mention a paper from more than 20 years ago from another major children's hospital. They found out that many families drive tens or even hundreds of miles to get to the local children's hospital. They often have to pay for accommodation. They have to get childcare for their other children.
Mostly, they're taking time off work and more often than not, that's unpaid. So when the surgery is cancelled, it's probably not surprising that they're quite emotional and even angry on many occasions. Timeliness, clearly, if the surgery doesn't go ahead and has to be rescheduled, then it's not being provided in a timely fashion. Efficiency, I already touched on that, and the other one that they always talk about is equitability, and at the point when we started this, we really didn't know what the impact was.
Why children's surgery gets cancelled
So why does children's surgery get cancelled? The number one cause is patient illness. So with the kids, it's usually coughs, colds, diarrhoea, vomiting, things that give you a fever, which means it's not safe to go ahead with the procedure. The second cause is no-show. They don't turn up. The third most frequent we call NPO violation. So in order to have safe anaesthesia, you need to have an empty stomach before we begin, and so we give instructions about when they can safely eat and drink and when they have to stop, and if those are not followed, then we can't go ahead with the procedure.
So we did a few cross-the-board interventions to try and reduce our surgery cancellation rate. We came up with some new, more colourful instructions for families with some pictures on of when it was okay to eat and what particular things were okay at particular times, when to show up. We sent out some text message reminders. And finally, we set up a process that if the pre-op nurses having a phone call a couple of days beforehand had any inkling that the kid was sick, then they would have a process to discuss it with our nurse practitioners and therefore try and avoid cancellation on the day by rescheduling in advance.
So we tested all these interventions, and then we rolled them out across the hospital. We were pretty happy that we succeeded in saving about over an hour a day, which was about 17% of the time, and that was good. But I'm a perfectionist, and so I wanted to work out why people were still cancelling. And also really being curious about this no-show, the second most common. So these were people that families didn't arrive, they didn't answer the phone when we tried to get hold of them, so we just really didn't know what was going on.
Finding data and early influences
A lot of my colleagues had opinions. One of the myths that I was told a lot was that dental patients are the worst for not showing up. So that was kind of interesting, and actually kind of relevant, because if there was some problem in the dental surgery department that they really weren't preparing the families properly, they weren't giving them the right instructions, we ought to be able to learn from another department that was informing their families better, and therefore be able to help with this situation. I didn't know what to do, but I did feel that we needed data. W. Edward Stemming is the guru of quality improvement. He said, in God we trust, all others must bring data. So he's the first of my influences.
But what sort of data? So please don't try and read my red slides, they're just to show you the morass of overwhelming information that we were trying to deal with. So we did some brainstorming, the things that are in the electronic medical record, like what things could potentially be associated with cancellation. We came up with a lot, as you can see. We even looked outside the medical record, we wondered about the weather, so we got weather data from the local airport, which keep good records. I didn't know what to do with this, this was too much. You may notice my pictures so far have all been in Excel. Microsoft Excel was not going to solve this problem for me.
I had a little bit of experience with computers, back from when I was nine. My father bought me this computer, and it had one kilobyte of memory, and I learned programming basic. From that I have an appreciation of concise and efficient code, but it really was not going to help me too much at this point. Fortunately, I found somebody who was able to help me, or at least his writing. The idea of tidy data was really helpful to me, and I started trying to wrangle some of this data we were getting. This was a long time ago, so I was using Cliar, and later on the tidyverse. I didn't use this book because it hadn't been written then, but I would strongly recommend for anybody at this point now, Alpha Data Science is an amazing book.
Exploratory data analysis
So, I started to do what I now know, but didn't then, would be called exploratory data analysis. So, this is ggplot. I've graduated a little bit further forward. I wanted to look at cancellations over time. Was there any seasonality to the cancellation rate, and it seemed that there was, because we've got some variation there. Why could that be? Well, I mentioned patient illness as being the most important cause, and I wondered, I got some data from the hospital's laboratory databases about the tests that were positive for some of the infections that we know kids get, you know, viral type stuff, and there was also some variation going on, and it seemed to my quick look to be kind of similar.
So, I started to realize I was going to need to model cancellations. Just to go back briefly to quality improvement, we know that when you're able to predict something, you might be able to prevent it. So, in this case, we thought we could target support to at-risk families if we knew who was at risk. We did a little bit of psychological research to try and help, so we knew what sort of interventions we could try. Also, we felt that prediction might help us with mitigation. So, if we had an idea in advance that particular OR slots might be opened up because the families wouldn't show up, then we could slip in some like short notice emergency cases. I also had a hope that I might be able to understand cancellation by developing a model for it.
Machine learning and a key influence
So, at this point, the most important influence came to me, influencer, I should say. It was actually an August, very hot August Sunday. My daughter was just starting at kindergarten, and they had an event where the families could get together and meet each other, and I met Emily's mum. So, the mother of one of the other kids in my daughter's class, it turned out that we got chatting. It turned out that she had done a PhD in physics, but now she was taking her knowledge of machine learning that she learned from looking at galaxies and things like that to healthcare. I told her somehow about the problem I was working on, and she said, you know, you need to do machine learning. I thought, well, I don't know. She said, you're a smart guy. You can do this.
She said, you're a smart guy. You can do this.
So, I thought, okay. So, I did start to do a bit of reading, and I realized that patient level cancellation prediction is a machine learning problem. She was right. It was supervised learning. There was a binary classification of cancelled or completed. I realized there were some specific challenges. One of them, like I told you, there's like four or five percent cancellation rate, so there's some class imbalance. Anyway, at this point, I was fortunate to come across the applied predictive modeling book by Max Kuhn, and through that, the carrot package. I wish tidy models had been available in those days, but that really got me started, and I was able to actually do some machine learning.
Key findings
So, we wanted to find some actionable insights from this patient level prediction. The first thing I discovered was actually it was good enough for risk stratification. We got pretty good models, especially for the second and third causes, the no-show and MPO violation. The patient illness was rather more difficult to predict, and perhaps that's unsurprising because kids do get sick. What we found was that the strongest predictor was if patients had cancelled for surgery before. Importantly, we found out that kids who came from socioeconomically disadvantaged backgrounds were at the highest risk of cancellation, and so going back to when I was talking about the Institute of Medicine's dimensions of quality, it turns out that surgery cancellation is an important contributor to disparities in access to surgical care.
Importantly, we found out that kids who came from socioeconomically disadvantaged backgrounds were at the highest risk of cancellation, and so going back to when I was talking about the Institute of Medicine's dimensions of quality, it turns out that surgery cancellation is an important contributor to disparities in access to surgical care.
In the Cincinnati area, interestingly, this is confounded by race, and this is a point that I want to come back to in a moment. What was very important to discover is that cancellation wasn't really the fault of any one particular surgical specialty as such. It was really down to the demographics of their patient populations, and this enabled me to go back to this cancellation myth, dental patients are the worst for not showing up. Well, they are in the sense of their backgrounds, but it's not something that the dental service is responsible for.
Geospatial analysis and busting myths
And this takes me on to another potential myth. It's always patients from northern Kentucky. Now, if you don't know the area of Cincinnati, I didn't when I moved there from London, this is a Google map. You can see the hospital in the middle. Although the city of Cincinnati itself is squarely within Ohio, a good proportion of the greater Cincinnati region, a bit below the river at the bottom there, is in northern Kentucky, so that's what we're talking about. Now, we come on to another important influence for me. John Snow, a fellow Brit, is an amazing man for pioneering two things that are now quite close to my heart. First of all, anesthesia safety, but secondly, epidemiology. You might have heard the story about how he mapped cholera cases in London, identified that one particular well was at the centre of where the cases were, they removed the handle from the well, and the cholera cases went away.
So, I wanted to do some plotting too. I started plotting cancellation rates by zip codes, and you can see the highest rates are in red here. So, yes, the northern Kentucky patients, there is an area of northern Kentucky that's a hot spot, but actually there are hot spots in other parts as well. So, another myth that we were able to bust.
Moving on a little bit further, I started taking a slightly different approach. Now, I've got the same data effectively in the left-hand panel here. So, this time I've moved into actually looking at something more granular than zip codes, because I realised zip codes were actually not a good idea. You can see that there is a heavy cluster of cancellation in the middle of the panel there. We also wanted to look at some stuff that we'd used before for the medical record. So, like the race in the middle panel, and also who was paying for the surgery, which was a marker of socioeconomic disadvantage on the right-hand side. What was interesting looking at this is that there was more similarity between race and cancellations than there was with poverty as such.
Now, I wanted to know more, and I couldn't get any more data out of the medical record. So, actually, like our first speaker today, I used tidy census to pull a lot of data from the American Community Survey and was able to find out a lot more about the neighbourhoods in which families live. From this, we were able to set up some machine learning models of cancellation rate in a geospatial sense. The areas in red hatching here are where the middle model gives a pretty good prediction, and basically that covers most of the densely populated area of Cincinnati. From this, we're able to find out a whole lot more about the neighbourhoods where patients with high cancellation rates were living.
So, first of all, we did find that American Community Survey data did help us compared to things that weren't available in the patient chart. We found that affected families tend to live in dense urban neighbourhoods with low literacy levels. They're often very close to the hospital campus, so offering a ride could be quite cost-effective because it would only be for a mile or two. We found that African-American families were disproportionately affected, which seemed to fit with what we found before. We realised that we could potentially improve the cultural competency of our communications with Black families. We also found that our communications with non-native speakers of English were actually not a great problem in the Cincinnati area, so our interpreting service was doing a great job.
Conclusion and reflections
So, just to conclude, on a whim, in I think 2013, I downloaded RStudio because there was no financial barrier to entry. A lot of things have happened. I was lucky to get an R grant, a federal R grant, which academics know is very sought after. I became a co-advisor to a medical informatics PhD student. I developed and we disseminated an understanding of children's surgery cancellation, which I've given you some highlights of just now, and got a couple of publications out of it. I wanted to take this further and actually reduce cancellation rates, and I submitted a number of applications, and unfortunately they were not successful, and then COVID happened.
For family reasons, I had to move to the East Coast, but by this point I'd developed quite a lot of proficiency in R and in data science techniques, and I was lucky to get a faculty position at the University of Pennsylvania and the Children's Hospital of Philadelphia, and I am continuing research using data science in some different areas now. Finally, some shout outs and pointers. Along my journey, Google has been my constant guide. I've used, Fran has been extremely helpful, and early on Stack Overflow was often where I was directed. R bloggers and various other blog sites have been incredibly helpful, and I've got here two people who are now positive employees whose blogs I've followed quite a lot in the past. Increasingly over time, I found more and more online books here through Bookdown, and GitHub was also incredibly helpful to me. I hope that that's going to be helpful to somebody else in the future.
A few questions we have time for. What packages did you use for that hotspotting graphic that you showed? I remember that was old, so I think I used SP. Now everything I do geospatial will be ESSA. Was it difficult getting the data you needed to answer these medical questions? Were there HIPAA concerns? There were definitely HIPAA concerns. All the geocoding that I showed you, I actually had to set up. I learned a little bit of Ruby just to set up a geocoding package, so I didn't have to send stuff out to Google.
And then someone said it looks like a tremendous amount of learning and work. How long did it take for the results, for you to get the results that you presented today? That's really kind of like seven years worth. Seven or eight years worth. I think that's all the time that we have for questions. Thank you so much.