
Rodrigo Silva Ferreira-When Rivers Speak Analyzing Massive Water Quality Datasets-PyData Boston 2025
Rivers have long been storytellers of human history. From the Nile to the Yangtze, they have shaped trade, migration, settlement, and the rise of civilizations. They reveal the traces of human ambition... and the costs of it. Today, from the Charles to the Golden Gate, US rivers continue to tell stories, especially through data. Over the past decades, extensive water quality monitoring efforts have generated vast public datasets: millions of measurements of pH, dissolved oxygen, temperature, and conductivity collected across the country. These records are more than environmental snapshots; they are archives of political priorities, regulatory choices, and ecological disruptions. Ultimately, they are evidence of how societies interact with their environments, often unevenly. In this talk, I’ll explore how Python and modern data workflows can help us "listen" to these stories at scale. Using the United States Geological Survey (USGS) Water Data APIs and Remote SSH in Positron, I’ll process terabytes of sensor data spanning several years and regions. I’ll demonstrate that, while Parquet and DuckDB enable scalable exploration of historical records, using Remote SSH is paramount in order to enable large-scale data analysis. By doing so, I hope to answer some analytical questions that can surface patterns linked to industrial growth, regulatory shifts, and climate change. By treating rivers as both ecological systems and social mirrors, we can begin to see how environmental data encodes histories of inequality, resilience, and transformation. Whether your interest lies in data engineering, environmental analytics, or the human dimensions of climate and infrastructure, this talk will explore topics at the intersection of environmental science, will offer both technical methods and sociological lenses to understand the stories rivers continue to tell
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hello everyone, welcome to an exciting talk. So today we'll be discussing about this talk, which is When Rivers Speak, Analyzing Massive Water Quality Datasets Using SSH in Positron. So Roger Gers is a Cure A engineer at Poseidon, so stay tuned and follow this exciting talk.
Thank you very much. And yeah, so my talk today is When Rivers Speak, Analyzing Massive Water Quality Datasets Using Remote SSH in Positron. My name is Rodrigo. A little bit about myself, just to start with. I'm a Cure Engineer at Posit, and I'm originally from Salvador, Brazil. That's my city, by the way. But I'm based in Pittsburgh.
My background is actually in chemistry, so it's very exciting actually to be at PyData Boston with the Python community, because then I can get to learn a lot from the different talks and so on from everyone. My main interests lie at the intersection of statistics, data, and society, especially with how to use data and statistics to understand and improve the world. That's the overarching theme of many of my talks, and that's related to what I'll be presenting today.
Throughout the talk, if anyone has any questions, please feel free to interrupt. I really don't mind.
Overview of the presentation
So just an overview of the presentation, I'll start by talking about rivers as storytellers of human history. Then I'll dive more into the U.S. Geological Survey sensor network. And then we'll talk more about the technical aspects of how to take advantage of all those massive pieces of data that are available through APIs. And then we'll talk more about remote station Positron as a potential solution to enable capturing and making sense of all of this data. And then in conclusion, I'll just wrap up with how data science can be a tool for a better world.
Rivers as storytellers
So beginning with rivers as storytellers, that's something that I became very interested in as a chemist, who is also interested in the statistics part of things. So this is a map of the U.S. with all rivers that someone used open source data from the National Hydrography Dataset in order to capture all of this.
And in terms of rivers, what really makes me very fascinated about them is their ability to record several dimensions of human history from migration patterns to industrialization, ecological resilience, inequality, neglect, and environmental deregulation or regulation and so on. So just as an example, that's a historical map of the Delaware River showcasing the route that George Washington's troops went through.
So that idea of rivers as storytellers of societies and of human history has always interested me. And that's part of the reason why I decided to talk about this today.
And more particularly, along with that theme of rivers as storytellers, I was always fascinated by the amount of data that's produced, especially here in the United States, from these rivers. I live in Pittsburgh, and Pittsburgh has essentially the beginning of the Ohio River. There are two rivers, the Alangany and the Monongahela River, that get together, forming the Ohio River. And the Ohio River has actually been used for a lot of mathematical modeling in terms of like pollution and patterns of rivers, just because of the pollution that goes into it, unfortunately.
So that's what I was interested as well, about the idea of capturing that data from U.S. Geological Survey in order to try to understand patterns of those rivers. And in terms of the size of that network, the United States has one of the largest environmental monitoring systems in the world, with over 2.3 million sites spread all over the country. And many of those sites are capturing data every 15 minutes.
The one they station at the beginning of the Ohio River, for example, it captures a bunch of water quality parameters every 15 minutes. And there is an API streaming that data as we speak, leading to hundreds of millions of measurements, as well as several years of data.
Which leads to two questions. Why should we care about river data? As well as how can we listen to these massive amounts of data that rivers have been sharing?
water quality data reveals a lot of things that are of societal concern, including shifts in regulation, political priorities, economic development, environmental justice, and climate transformations, in a way that data is not just numbers, but the social history that encodes a lot of aspects of our societies.
And in terms of how can we listen to the data that rivers are sharing, that is a challenge. Because both in terms of computation, as well as storage, and so on, it's difficult to be able to capture the magnitude of that data and to be able to process it and analyze it and so on. So that's what made me really intrigued by that problem.
Technical overview
Which leads to us talking about the technical overview of what we'll do today. So the first step was to fetch the national data, a national site index. That national site index contains the number that's assigned for each of those sites that are spread all over the United States. And the reason why we have to be very strategic about how to fetch is because those numbers are together with a massive amount of data.
And a lot of those APIs are quite old. So it's difficult to be able to process all of that and to have a correlation between each number and each site. And there's no very accessible way to see here's a list of all numbers and all sites. You really have to make API calls and so on to the USGS water services API.
And then we'll go through the sink ingestion of three years of data. We are only doing three years because just three years is already massive. We could go to 30 years of data and see what happens. One caveat though is that some of the data for some stations only became available recently. So that's the tricky part. So there could be a high level of missingness in some stations and so on.
So that's what the third topic is, scaling to 30 years. And then followed by structuring the big data with Parquet or DuckDB. And then finally interpreting environmental time series as a sociological and historical evidence.
So the question was where the data lives. So there's the USGS API water services API that you can make calls to. And that data is being in a lot of stations it's being streamed every 15 minutes for example. If anyone wants to explore more of this data, just a heads up that I think in January 2026 a lot of the API endpoints are being deprecated, which makes that whole idea of exploring those data sets extremely relevant and pressing in terms of time.
Water quality parameters
So in terms of temperature, very high values can stress the aquatic life, very low values can slow the metabolism of the organisms living there. The specific conductance is typically correlated with clean or polluted water. Low values would indicate that typically the water is clean. High values could indicate some pollution from industry and so on. Dissolved oxygen is one of the main ways also to quantify pollution because if you have very low dissolved oxygen, it usually indicates that zones and fish kills, whereas high dissolved oxygen usually indicates a healthy ecosystem. And finally the pH can indicate possible contamination or mineral imbalance in the water, which can cause some stress to the species and so on.
So then in terms of the ingestion of those four parameters, there are thousands of sites for each state. Some cities might have even multiple sites where data is being collected, and leading to millions of rows per ingestion batch. And by ingestion batch here, I mean that in the way I plan the ingestion, I decided to do it by state. And then using a sync request and thread post to speed up the process is also helpful, which leads to producing the state-level Parquet partitions. And in the end, just three years of data, four parameters, leads to 1.8 gigabytes of Parquet data. If you do it in CSV instead, it's much larger.
Scaling and remote SSH
The reason why I say that is because although many of us do have machines that can process a lot of data and can collect at scale and so on, sometimes if you have limitations in terms of equipment, in terms of computing capacity, even tasks that we are used to in terms of data collection and processing can be a challenge. Even for three years of data, collecting the data without just with my machine, for example, it took about 70 minutes of running it continuously. And during that collection, there's always the ethical considerations behind it as well of making sure that you are not stressing servers, especially from government infrastructure and so on.
And we have tens of millions of rows of sensor data if we were to go through 30 years, for example. And if we scale up more than four parameters, that would also lead to several more columns. And besides memory limits, some people can also be limited by storage. And that's how remote SSH can help overcome some of these gaps.
So let's jump on to how remote SSH can be a solution to some of these challenges. So I'll start by explaining how it works in a higher level view. So in terms of compute and storage, there are different types of providers that can provide virtual machines and so on so that we can remote into those virtual machines.
If anyone, and in my talks, I always like to give ideas so that people can try it themselves for free. I usually don't like to, oh, let me recommend this provider or that provider. There are a bunch out there that are giving free trials and credits and so on. So if you want to try remote SSH with Positron and you need a VM or a provider, you can just look it up for the ones that have free credits available. Some of them have very generous free trials.
And what remote SSH in Positron provides is a local UI with the ability to do remote computing. And that's very helpful because you're able to take advantage of Positron's capabilities and the UI while outgrowing your computer's resources and so on.
And what I find really good as well is to pair the expansion of computing capabilities with other good practices when it comes to large data sets, such as using Parquet files, for example, which are fully supported within the Data Explorer and so on, which I'll show later as well. And lastly, that can help with a scalable network throughput for async ingestion pipelines, which is what we have today.
Positron remote SSH features
So with that, I put some screenshots of different areas of the application. The first one I would like to highlight here is that you can pre-configure specific hosts so that you can connect to the desired host. So once you click Connect to Host, there will be a list of your pre-configured hosts. And then you can just go ahead and connect to it. This is very helpful if you have multiple remote environments, because sometimes people, for example, might set up one VM, one box for light tasks and another one for very heavy tasks.
Once you're connected, there is a very clear indication that you are connected in the lower left corner. And I think that's very helpful to preventing anyone, including myself, from getting lost, from not knowing which box you are at. And I did play around with multiple boxes. And at some point, I had to rely on that indicator to know, oh, I'm still using that one.
And then one other part that I find really helpful is the automated port forwarding, which means that you can have different types of applications and the hosts being forward to the specific ports. For me, that I'm not used to assigning the ports myself, that's just generally really helpful. You can run, as usual, the Shiny apps and so on. You can still run the Shiny app within Positron with the automated port forwarding, even when you are inside a box.
Exploring the data
So that's the parquet files for three years for each state. And one thing I would like to emphasize here is if I go on to, for example, Pennsylvania. That's how the data looks like.
We can see here that we have over 20 million rows, eight columns. And one thing I really like about exploring data with the Positron, and specifically with that data explorer tool, is the ability to interact with your data and get to know it better. Because especially now with AI, it's very easy to fall into the trap of asking AI to do things and then it messing things up. Whereas once you are actually interacting with the data, you get to know it better. And you can avoid some pitfalls.
So it's 11% of my rows were for that parameter. And then one thing I had noted before already, but I just wanted to show, if you're interested about the data, everyone knows the range that pH is usually assigned to. Like up to 14 and so on. If you go on sort ascending, for example, you can see that you have minus 99999. Does anyone know why that pH is so negative? Yeah, just missing data. They just assigned that value.
So yeah, situations like this, if you're not in close contact with your data, that's the type of thing that AI can take on and really hallucinate. And then let's say if you were to get the average, for example, of all of this, it would be quite bad because there is a lot...
And then what I find cool as well, let's say if I want to share this view with someone so that the person can put it in a notebook or something, I can just convert to code to like SQL, Pandas, or Polars, and it can just send it to them. So that type of exploration is something that I'm really interested in data science, both through notebooks and data exploration in general, in order to enable people to, when they use AI, they can also know what they're doing with their data and so on.
It's the Data Explorer within Positron. You don't need to install anything. It's already native. It works with Pandas, Polars, DataFrames, like anything pretty much.
So he was asking about the changes that we had to make to enable remote SSH in Positron. I joined Positron in April. So to be honest, I don't know all the details of the work that went behind it. One thing that I would mention, though, that using remote SSH, and I will mention actually someone who did the implementation, and I'm happy to put in touch with him, but one thing that I enjoy about remote SSH in Positron is the ability to use it within your usual workflow of data exploration. So you don't need to swap between different applications or different things or install extensions or any of that. It's just already there, native, and all batteries included and so on.
Results and benchmarks
So in terms of the results of some exploration that I did for the three-year data collection, the ones in asterisk are ones that the speeds of the runtime were estimated, actually. I actually didn't do them. But I tried to use a machine that performance is about the same as my laptop, as a baseline, as well as some stronger, like higher computing capabilities virtual machines. And I essentially did the same task and recorded how long it took for each of the remote hosts. And I could get up to this point of reducing from seven to one minutes, that's how long it took in my laptop, to 33 minutes, so by half, essentially, the time.
Here are the costs, hourly cost, total cost per full run. Of course, usually you run into issues when you are doing it. Sometimes something might stop working and so on. So I wouldn't budget based on how long something would take to, you expect something to run. But it's a good indicator, essentially.
And the reason why I also stopped here is because once you start getting to a certain point, it doesn't really matter that much anymore what your computing capabilities are, because you might get rate-limited by the API and so on, and you don't want to stress the server or any of that. So it becomes less of a concern of your infrastructure that you're using, more of a concern of the external, the server infrastructure of the API provider.
So yeah, and then in terms of when I think of processing 30 years of data, for example, I don't know what the results would look like, but it's very likely that that difference between my laptop versus more powerful machines would be very stark. The machines that I was using was from AWS, but there are a lot of virtual machine providers that can be good for that type of exploration. And again, taking advantage of the free trials, I think that that can be a really good opportunity for anyone interested in outgrowing their laptop with remote SSH.
Conclusion
So just a few remarks to conclude the presentation. First, as shown here, remote SSH in Positron can help with any projects related to mass scale data collection, storage, and processing. It's very useful to do so because in this way, users or any data scientists or data enthusiasts can take advantage of very powerful remote machines while remaining within the data science focused ID environment. And in general, I'm really excited about building data science and open source tools that can help us understand and improve the world in some way.
So that's part of the reason why I'm really excited about Positron and ID development in general.
And just as in my presentations, I always like to thank people who helped me throughout the process. I'm not a remote SSH expert, but he is an expert in remote SSH. He's one of my coworkers. And he gave a talk at PositConf about outgrowing your laptop with Positron. If anyone is interested and wants to know more details about remote SSH, you're welcome to check. It's on YouTube.
And I would also like to thank the United States Geological Survey employees who have been doing a lot of heavy lifting to preserve the much needed resources that exist for those API endpoints and so on. And again, if you are interested, feel free to fork, to clone the repo, to try it out, to collect some data from their API, because again, maybe next year some of the endpoints I used won't be available anymore, so maybe it's the last chance we have to do so, which makes it even more urgent and interesting.
And with that, I want to thank everyone for listening. Let's keep in touch here, my contact information, and again, I'm really interested in tools and data science with a focus on a better world, so if anyone is interested in those type of topics and projects, please contact me, and if you have any issues with Positron or remote SSH, also please feel free to reach out, and I'm happy to help.
Q&A
Have you made any interesting observations from that data you collected, maybe any visualizations?
Yes, so I had done a blog post in the past with some visualizations showing. What I noticed with temperature, for example, there is a lot of seasonality and so on, that's very expected. Two of the variables were correlated, which I found interesting as well. I can send it to the blog post after, I think that would be the most helpful, yeah.
I just wondered whether, like what stories you maybe got out of that, you know, the data and exploring it. Is there something that got you into reverse, like something in your life that happened that made you interested in this?
So yeah, so it's, when I was in high school, I was studying some differential equation for modeling pollution, and that differential equation is the Streeter-Phelps equation. It's modeled after the Ohio River, by pure coincidence. Years later, I moved to Pittsburgh, and I was like, oh, this is the Ohio River, the famous Ohio River. No one knew about that, but I was always like, oh, the Streeter-Phelps equation was modeled after this river. So that's kind of what made me feel interested in that.

