Rodrigo Silva Ferreira-When Rivers Speak Analyzing Massive Water Quality Datasets-PyData Boston 2025

Transcript#

This transcript was generated automatically and may contain errors.

Hello everyone, welcome to an exciting talk. So today we'll be discussing about this talk, which is When Rivers Speak, Analyzing Massive Water Quality Datasets Using SSH in Positron . So Roger Gers is a Cure A engineer at Poseidon, so stay tuned and follow this exciting talk.

Thank you very much. And yeah, so my talk today is When Rivers Speak, Analyzing Massive Water Quality Datasets Using Remote SSH in Positron. My name is Rodrigo. A little bit about myself, just to start with. I'm a Cure Engineer at Posit, and I'm originally from Salvador, Brazil. That's my city, by the way. But I'm based in Pittsburgh.

My background is actually in chemistry, so it's very exciting actually to be at PyData Boston with the Python community, because then I can get to learn a lot from the different talks and so on from everyone. My main interests lie at the intersection of statistics, data, and society, especially with how to use data and statistics to understand and improve the world. That's the overarching theme of many of my talks, and that's related to what I'll be presenting today.

Throughout the talk, if anyone has any questions, please feel free to interrupt. I really don't mind.

Overview of the presentation

So just an overview of the presentation, I'll start by talking about rivers as storytellers of human history. Then I'll dive more into the U.S. Geological Survey sensor network. And then we'll talk more about the technical aspects of how to take advantage of all those massive pieces of data that are available through APIs. And then we'll talk more about remote station Positron as a potential solution to enable capturing and making sense of all of this data. And then in conclusion, I'll just wrap up with how data science can be a tool for a better world.

Rivers as storytellers

So beginning with rivers as storytellers, that's something that I became very interested in as a chemist, who is also interested in the statistics part of things. So this is a map of the U.S. with all rivers that someone used open source data from the National Hydrography Dataset in order to capture all of this.

And in terms of rivers, what really makes me very fascinated about them is their ability to record several dimensions of human history from migration patterns to industrialization, ecological resilience, inequality, neglect, and environmental deregulation or regulation and so on. So just as an example, that's a historical map of the Delaware River showcasing the route that George Washington's troops went through.

So that idea of rivers as storytellers of societies and of human history has always interested me. And that's part of the reason why I decided to talk about this today.

And more particularly, along with that theme of rivers as storytellers, I was always fascinated by the amount of data that's produced, especially here in the United States, from these rivers. I live in Pittsburgh, and Pittsburgh has essentially the beginning of the Ohio River. There are two rivers, the Alangany and the Monongahela River, that get together, forming the Ohio River. And the Ohio River has actually been used for a lot of mathematical modeling in terms of like pollution and patterns of rivers, just because of the pollution that goes into it, unfortunately.

So that's what I was interested as well, about the idea of capturing that data from U.S. Geological Survey in order to try to understand patterns of those rivers. And in terms of the size of that network, the United States has one of the largest environmental monitoring systems in the world, with over 2.3 million sites spread all over the country. And many of those sites are capturing data every 15 minutes.

The one they station at the beginning of the Ohio River, for example, it captures a bunch of water quality parameters every 15 minutes. And there is an API streaming that data as we speak, leading to hundreds of millions of measurements, as well as several years of data.

Which leads to two questions. Why should we care about river data? As well as how can we listen to these massive amounts of data that rivers have been sharing?

And to answer the first question, I think water quality data reveals a lot of things that are of societal concern, including shifts in regulation, political priorities, economic development, environmental justice, and climate transformations, in a way that data is not just numbers, but the social history that encodes a lot of aspects of our societies.

water quality data reveals a lot of things that are of societal concern, including shifts in regulation, political priorities, economic development, environmental justice, and climate transformations, in a way that data is not just numbers, but the social history that encodes a lot of aspects of our societies.

And in terms of how can we listen to the data that rivers are sharing, that is a challenge. Because both in terms of computation, as well as storage, and so on, it's difficult to be able to capture the magnitude of that data and to be able to process it and analyze it and so on. So that's what made me really intrigued by that problem.

Rodrigo Silva Ferreira-When Rivers Speak Analyzing Massive Water Quality Datasets-PyData Boston 2025

Transcript#

Overview of the presentation

Rivers as storytellers

Technical overview

Water quality parameters

Scaling and remote SSH

Positron remote SSH features

Exploring the data

Results and benchmarks

Conclusion

Q&A

Featured software#

positron