Hunter Owens | Tidy Transit: Real Life Data Modeling for Public Transportation | RStudio (2022)
California Integrated Travel Project’s mission is to make transit across California simpler and more affordable. As part of this, we created an open source data warehouse to allow easy analysis of the travel data people often interact with every day. In this talk we’ll discuss two big challenges we faced: Creating tidy representations of daily schedules and payments data across 200 transit agencies. Enabling people with a range of backgrounds (R, SQL, and python) and experience to quickly analyze the data. Tidy data allowed us to turn equal focus on agencies running a single bus, and those serving entire metropolitan areas. Session: Cat herding: solving big problems by bringing people together
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I am the Benefits, Delivery, and Decision Transformation Manager for the California Department of Transportation. Previously, I was the Data Science Manager for the City of Los Angeles as of the Center for Data Science and Public Policy at the University of Chicago. Mostly about data, tacos, Los Angeles, sunset photos, some architecture, all fun times.
So here's the existential panic that wakes me up in the morning. 50% of greenhouse gas emissions in California are currently caused by transportation. Your state might be slightly different or country, but it's going to be in the same ballpark. If you've been paying attention to the GHG world at all over the last decade and climate change and whatnot, you will know there's this big transition of that was not true in 2010. In 2010, the biggest emitter in California was power generation. That sector is very quickly decarbonizing. Who can bank wind and solar being really cheap?
So you're going, great, Hunter. Buy everybody an electric car. We got this solved, right? And the answer is you are wrong. I apologize, but if you look at the lifetime total carbon emissions of transportation, electric cars are only about 50% better. And maybe if heavy industry becomes greener, maybe they'll be about 75% better than, you know, conventional gasoline driven cars. Totally fine. This happens. We have a plan for this, which is we have a net zero by 2050 plan in California. The idea is California needs to have a massive reduction in something called VMT, which you hear thrown around in those jargons, stands for vehicles, miles traveled. Transportation world, much like technology world, filled with acronyms. I'm going to try to only use the tech acronyms today.
So that gets us there. So California, we wrote all these plans, and we established that we have to double our transit mode share in California by 2030, and then we need to 5X it by 2050. And you're like, great. How do I take the bus?
The transit landscape in California
So looking at this, I was actually going to ask, show of hands, who, like, regularly takes public transit? And this is, like, more than a lot of talks I've given. Who takes public transit and pays cash? Who has, like, some sort of weird smart card that they have to load? Okay. Mostly smart card users. Who gets to just use normal people money, like you would go to a coffee shop? All right. Who takes an agency that is, like, super tiny? Like anybody take Dial-A-Ride?
So long story short, California, you might think, is only three or four transit agencies. You got, like, LA Metro, you got BART. There are over 800 transit agencies with over 200 agencies offering fixed route service in the bus. So when you think about California, Cali TV was founded to answer this challenge, which is I want to go from Santa Monica to Sacramento or wherever, and I want to be able to seamlessly plan, pay for, and execute my trip. We want to unify transit in California with a common fair payment system so you're not, like, juggling all those cards. Anybody who got a smart trip card for DC just for the week can deal with that. Make sure that you have realtime data on, like, where is the bus, and then make sure it's seamless to get verification of discounts and whatnot.
Making transit seamless
So the things that we really do to make transit seamless, and I promise I'm going to get to our stuff soon, unify the customer experience, make sure that we are prioritizing sustainable modes of transportation, and ground it in equity first. So how do you think about a transit trip? There are three parts of the trip, right? So the first thing is, can I plan and get to the bus or the train on time? The next thing is, once I get there, can I pay for it successfully? And then third, in the United States, we have a ton of different types of transit discounts and benefits. You might be a senior. You might be a student. Your employer might pay for your trips. There's all sorts of things.
If you think about California, it is big. There are tons of, like, legacy systems that you have to go interface with. So we have a philosophy of, like, transit technology needs to be like Legos. You can snap your fare payment system into your realtime passenger information system, into your benefit system, into all your different CAD AVL subsystems and whatnot. We call a lot of this work on the data front, mobility service data, which is all about providing... So now we're going to wonk out on trip planning. Providing accurate and complete information for trip planning.
So you think, like, what is the vision for that? It's that you have a complete and accurate picture of what is going on at any of those 800 agencies in California. And then when I pull up my phone, whether I'm in Humboldt County in the far north of California, or Imperial County down near the U.S.-Mexico border, I could find where is any bus, where is any train, what are my transfers, how much am I going to pay? And fare should easily show up on Google Maps.
So the way we do this is we establish clear expectations. We assess every agency in the state using automated technologies to say, how well are you doing this? And then we provide technical assistance to go meet those goals. So what is the clear expectation? For California, for those of you who are transit nerds, you may have heard something called the General Transit Feed Specification, formerly the Google Transit Feed Specification. This is the thing that means Google Maps, Apple Maps, City Mapper, Transit, your R applications can see bus locations and bus schedules in both real-time and schedules. And then on the back-end side, we also have a set of interoperability principles for the vendors that piece together these different pieces of information to work together.
Assessing progress, this is what's really fun. We take a look at all this data, and we produce in-depth assessments for each agency being like, oh, you didn't have trip information for 33% of your buses in the last year in real-time, something like that. We talk to them every year doing a collaborative list of stuff. We produce monthly public reports on public transit data quality and public transit quality, how many jobs are agencies serving, how many folks are doing. And we do this for every one of the agencies in the state that provides fixed-route bus service now.
And then we do night light, and we monitor this all nightly to make sure that, like, when things break, for example, right now school year is ending. A lot of transit agencies are changing service. A lot of the smaller agencies are not telling Google that that service is changing, which means if you show up to the bus and they're using an old version of their schedule, you're going to miss your bus. So we really try to make sure that that stays on tap.
And then, finally, we do a lot of technical assistance work across our whole program. We have a transit data help desk, you know, which has been really focused on getting demand-responsive transit into GTFS. That's stuff like dial-a-ride, paratransit, whatnot. We've built open-source hardware and software to produce GTFS real-time to help small agencies in rural areas to go do it who can't afford more complex systems. We've produced playbooks and documented transit stacks for different agencies, and we do interoperable development work.
Paying for transit
All right. So magic universe, we've done all this work. You've now shown up. The bus is here. What's about to happen? We need a new way to pay for transit. So, you know, from what we say, paying for transit should be as easy as paying for coffee. Customers can instantly pay, tap their bank card, tap their Apple Pay, Google Pay, their mobile wallet and whatnot. In order to do this, you need three pieces of technology. One is called a fare validator, second thing is called a fare calculation software, third thing is called a payment processor.
paying for transit should be as easy as paying for coffee.
One of the things I love about Cal ITP is we hate inventing stuff, so we really try to leverage global standards. So this is how the banking system works. I did not know anything about how banks worked before I started this job, and now I know perhaps a little bit too much, so... But rather than try to invent our own custom flavor of how do cards, how do we take money, do we need to set up TVMs or retail networks? We just said, how can we make sure that banks and, you know, standard credit and debit cards can work?
So those three pieces of technology that I mentioned are over here, and they talk between the provider's bank, so this is the state of California has banking relationships, and those go talk to MasterCard Visa, it talks to your customer's bank. All of this happens basically magically. This is called the CEMV standard, and any transit agency in the country can buy this. We've done standardized procurement. Anybody who's a government person knows procurement is the worst thing about government, absolutely. Maybe hiring is worse, but pretty bad.
So we have what we call the mobility marketplace for purchasing these types of solutions, and it's prenegotiated, it's like a bench contract, really easy. And then on a data side, what my team can do is, because that data is standardized, the types of outputs you get from CEMV banks, they've documented this, yes, you have to transmit all the data over SFTP, but it is standardized, it comes in every day. We can build merchant services for these transit agencies that helps them understand their revenue per day, what type of fares they're getting, are people paying with cards or digital wallets more often, and that's really focused on the transit agency's needs, joining that data with their GTFS data upstream to say, like, this route is doing more rides, whatnot, in there.
Core data sets and analytics
So we have 800 transit agencies we're stewarding. That is too many to manage ETL processes for each of those agencies. So what we focus on is what we call our core data sets, which is we use these three data standards to really drive our analytics processes. So GTFS schedule, we do that every night. We do all sorts of different output analysis from that data. GTFS realtime, again, this is just, like, where is the bus and when is it going to show up at the stop? We get that data every 20 seconds. This is sort of actually gets to be big data, because you start talking about the bus location of every bus in California every 20 seconds. You can do that to do, like, level of delay analysis, speed analysis. How can you optimize where to place potential infrastructure investments? And then finally, our third standard, EMV payments, two days after the transaction is when it comes through. And that powers, you know, both our payments dashboard and a lot of, like, traditional reporting type work, if you're familiar with the national transit database and stuff.
So how do we do this? Again, standard, standard, standard, simplify, share your resources. So for batch processing and data warehousing and shared compute, we have a, you know, Airflow, Kubernetes and BigQuery sort of base stacked. All of our code and documentation is open source, so I encourage you to go peruse it. When it comes to analysis, we tend to do we have a JupyterHub. We've been pretty heavy users of Suba on top of BigQuery, which is the report for Python. Highly recommend checking it out, especially if you're there. And then finally, for reporting and presentation, haven't adopted Quarto just yet, but we do things like put together all these sort of different reports, some of which are more lightweight but even more complex.
This is a whole website that is generated out of a Jupyter notebook using Suba to give you GTFS quality reports for all these agencies, so, again, web is a little slow, but you can see let's pick on Big Blue Bus, that's Santa Monica, California, and you get all this sort of information, and what's really cool about this is you do not need an analyst who has or a web user to speak it up. This is entirely generated out of a Python notebook that is linked in there.
This is just some of the elements you get in schedule. You can do these statewide analysis of what is a high-quality transit corridor. Again, this is looking at all of Los Angeles. This is something the state did not do, and the reason we were able to do this, and if you take one takeaway from this talk, it's that data standards, especially if you work in a distributed environment, are your best friend, and even if they're not perfect data, exactly the right data.
it's that data standards, especially if you work in a distributed environment, are your best friend, and even if they're not perfect data, exactly the right data.
When you start talking about GTFSRT kind of digging in, there's trip updates, vehicle positions, which is exactly what they say they are, again, we kind of do this over and over again, so you can see here are live or were live screenshotted bus speeds for Los Angeles, Modesto, and Oakland. Suffice to say there's a lot of traffic, but this has really been transformative in how Caltrans does its investments, because we can really now focus in on this challenge of bus speeds, which is that.
Some more fun, here's a fun chart of speed variability for an individual bus route. You can see that, you know, in between certain stop pair combinations, buses have averaged as low as 10 miles an hour and as high as 35 miles an hour, which is definitely the traffic engineers have their work cut out for them, and you can use it for, like, evaluation of, like, okay, here's exactly where you need to drill down, and you can go from this big, big global Google Maps, Apple Maps view down to route 44 westbound in the afternoon peak is 5.5 miles an hour on this one thing, and we do it for every agency in the state.
So all of this is public and online. I think I have the analysis site pulled up, so you can go browse these speed maps. We publish this at our analysis site. All of these are linked in the chat. Here's Sacramento. Again, you can kind of drill in. All just static maps, which I love, because we have no hosting stuff.
And that is Cal ITP. That is, like, using data to really drive transformative transformation investment and operations. I'll take any questions you might have. I realize I rushed through a lot of that.