Resources

Championing modern science workflows to benefit dairy farmers (Mark Neal, DairyNZ)

Championing modern science workflows to benefit dairy farmers Speaker(s): Mark Neal Abstract: Dairy research faces data volume and variability challenges. DairyNZ's Modern Science Workflows project addressed this via infrastructure, capability, and business disciplines. Infrastructure: Snowflake cloud data warehouse and Posit Workbench for R. Capability: R data science courses (Uni of Waikato & internal). Business: Meetings, documented code, best practices, and moving towards open science with GitHub at organisational level. Results: We have enabled large dataset projects utilising machine learning which would not have been possible before (e.g., animal sensor data analysis). Training has been well-received. Modern workflows enable reproducibility, and data science skills should become standard scientific training. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Right, thanks everyone. I hope you enjoyed an ice block over the break. So yeah, I'm talking about what we did at my organisation to improve our science workflows to benefit the people we work for, which is dairy farmers. And hopefully out of that journey there's a few lessons that might be useful for you.

So, a little bit about our organisation, DairyNZ. Our purpose is to progress a better future for dairy farmers. A couple of quick facts about the New Zealand dairy sector. We produce more milk than Wisconsin, your dairy state, not a competition. Also more milk than California, also again not a competition. But there's not many people in New Zealand, which means that 95% of the dairy products gets exported all over the world.

So it's important to us that we have products that people love and enjoy and are happy with how that's produced. So a lot of our work at DairyNZ is about how can we help our farmers be more profitable, sustainable and competitive going forwards.

So this is where I think everyone wants to be. They want to be a happy cow, they want to be doing useful work and having the tools that are fit for purpose. And I can't say we're always there, but I know about five years ago we looked a lot more like this. We were struggling to do what we needed to do and the tools were just not fit for purpose.

So this talk is a little bit about how we channeled some of that frustration with a bit of boldness to then create change and deliver on impact.

Why change was needed

So why do we need to change? Maybe a third of you will recognise this thing. This is a floppy disk that holds about one meg worth of data. So historically, 10, 20 years ago when we were doing a farm trial, we'd get a few observations per day. So you could have a major trial where all the data would fit on about six of these floppy disks.

However, today we have wearable devices. You can imagine Apple Watches for cows. Devices like this are generating 60 observations a second and you're putting this across hundreds and sometimes thousands of cows and that's generating a metric crap tonne of data.

And so if you put all the data for a three-month trial today onto floppy disks, the tower of floppy disks would be taller than the Burj Khalifa, like the tallest building in the world.

Also at the same time, we had to move from linear science in the sense of scientists would have an experiment, they'd run the experiment, they'd throw some data at the statistician, produce some plots and some tables, publish it, job done. But when we're looking at this more complex data, it becomes more iterative in the sense of looking at the data, understanding what it's trying to tell us, come up with some hypotheses, test those. So a very different way of working.

So these are not all the reasons, but these were a couple of the obvious reasons that created that burning platform for why we needed to do something different.

Infrastructure

So we did three things. The first thing was around infrastructure. So if you've opened the R for Data Science book, this diagram's probably quite familiar. Anyone that's got R can get started and use this workflow. But when you want to work at scale, you want to add a few things.

So for us, we've got a data platform. We chose Snowflake. You might choose Databricks or something else. For sharing code and versioning code, we use GitHub. Again, there's other options there. When the smoke is starting to pour out of your laptop, it's time to move that compute to the cloud. So we use Posit Workbench for that. And then when we've created artifacts, tools, dashboards, et cetera, that we want to share, then we push that through to Posit Connect.

So what guided some of those choices? Well, it needed to be fit for purpose, and for us, that meant performance. It needed to be scalable, as we did more and more of this type of work, and reproducible is just really good practice. Also, we're starting from a low base, so we want a relatively low cost to start. So some open-source tools fit the bill there. And you also need ease of management by your IT. So that forms part of your total cost of ownership.

You're going to need some confidence in the vendor, but you also want to limit the technological lock-in wherever possible, so you have options down the future. You know, the landscape might change. For us, support for R and Python, we're primarily R now, but I could see there are some things we might want to use Python for in the future, and depending on where you're at and what your situation is, you might make different choices.

And so this involves spending money, so you're going to need some allies from around the organization, extra points if you get IT on board, and then you're going to have to get the senior buy-in. They're the people that are going to be writing the checks for some of this stuff.

Skills and capability

So the second of three things, the skills and capability. So when we were starting down this journey, we ran R for data science as an internal course. We ran it as a cohort, so a bunch of people all learning at the same time, and the temptation is whenever you're doing something in that setting that you try and just jam it all into, like, say, three days. We'll bring everyone in, smash them full of knowledge, and then job done.

But I took from the, what was the RStudio education piece that they used to have around the space learning, which is to say, you know, if you learn a bit this week, then you've got time to digest it and use it, and then more the next week, et cetera. So that space learning as a concept, I think, is quite useful.

You know, there was a few pros and cons to running it externally, and we decided to go external for subsequent versions and ran it through the university. The things I like about doing it externally is that you're getting some commitment from both the person and the manager. Even though it doesn't cost a lot, you know, the fact that it actually costs something means that that person and manager have to have a discussion and they have to agree that this is a priority for this person, and that means putting aside time to do it properly, which is not necessarily what happened when we had an internal course. And if people got busy, then if it wasn't a priority, it didn't happen.

I also like the summative assessment, so assessment at the end of the course, which is to say, you know, people understand, you know, most of these people have been to university, so they understand that, you know, there's going to be a test at the end and they're going to have to have some minimum level of knowledge in order to pass that and not be embarrassed in front of their peers. So I like that.

Working with the university, though, you do have to get on the same page because, you know, they might have different ideas about how they want to teach it and what's the place of AI tools in terms of teaching it. From our perspective, like, we were quite pro-AI tools. You know, we want the people to learn something, but when they're in the workplace, they're going to be using AI tools, so we want them to learn how to work with AI tools as part of that process.

So one of the things here is, you know, we tried internally, it didn't work. So in this case, you know, motion, like trying something, if it doesn't work, you try it a different way, but just having that motion towards where you want to be is important. We did do internal stats course for a smaller subset that needed that extra statistical knowledge, and then we ran GitHub and Snowflake lessons as well so that people could understand the whole piece.

Business discipline and continuous improvement

And then the third of the three things, to have a bit of business discipline around how we use these tools and that continuous improvement mindset. And often people can forget to celebrate the wins, but when people do something cool, you know, you want to acknowledge that and, you know, even say, that's cool. Or when you're helping someone with something and they've overcome a barrier, then, you know, we want to really recognize that.

That keep learning piece, so we run our own internal R community. You know, we have external speakers come in and we have some, you know, colleagues that we work with externally that come to that as well. And we want to have a place where people can, if not learn best practice, then at least learn good practice. And so we have this internal getting started book where, you know, that's the first port of call when they have a question. You know, you can of course go to other people afterwards, but, you know, if a lot of the key knowledge is there, then that saves, you know, friction and time between going around and asking people because often it ends up being the same people.

So that's what we did. And persistence, persistence is important. It's always easy when you're short of time, for yourself or your colleagues, you know, to start taking shortcuts, but that's not the way for, you know, long-term performance. So we want to think beyond our next deadline. We want to think about, you know, what's best for us over our lifetime of work.

What we delivered

So what have we delivered? Some use cases that, you know, might be interesting. So we wanted to really collect quality data. So here we've got someone out in the field collecting pasture data by cutting pasture for our calibrations, which is great when it's sunny. It's not always sunny outside, and then so you drop it in a puddle and then someone rides the motorbike over the top. You can imagine what that does for data quality.

So what we've moved to is basically digital first data entry with a ruggedized device in the paddock, so built an app so that we could get that data digital first and then boom, it's off to Snowflake pretty quickly.

When you start having data in close to real time, then you can start doing reports with it. So when we do a trial or experiment, we might have cows, you know, this group of cows on what we call a mini-farm or a farmlet, and these cows on another mini-farm, they might be fed different things, for example, and you want to understand how that's going. You can have these markdown or Quarto reports where the scientist and the technical team on the ground can see what's happening to feed supply, what's happening to milk production, etc. and get that real time feedback and make sure that you're keeping an eye on everything that needs to be working.

Another thing we do is get information to farmers. So this is a project Connected Farm. So you can see this farmer here, it's early in the day and they're thinking about, well, what's this day going to be like? Is it going to be particularly hot? Do we think the cows might be, you know, experiencing some heat stress later in the day? So we did some, you know, experimental work with a device that measured the internal temperature of the cow. We had our weather station, did a predictive model, and then that's a prototype delivering that to a phone or via a dashboard there so that the farmer can start putting in interventions if it is going to be a risk of heat stress.

Other tools for farmers. So historically a lot of our economic information, you know, we'd collect information from thousands of farms and at the end of the financial year we'd start doing a building report on that and create a PDF report. A PDF report is okay, I guess, particularly when a lot of people want it in hard copy. But we move to a markdown and we'll move to Quarto eventually to deliver an interactive report, which is substantially better. But then once we'd done that mechanics to build an interactive report, it was only then another step to go to something more forward looking. So we could build in a forecast and so now we've got on our website a Shiny dashboard that farmers can refer to to see how costs are tracking for different regions and people are using the same ownership structure and then they can start digging into the line items for those costs to see what's happening across the country.

And then a final example in terms of strategic management. So we have farmers that are thinking about, okay, what do we need to do for the long term? And quite topical is around greenhouse gas emissions and how do we reduce the footprint per unit of product? But we still want these farmers to be profitable.

So we did some work with Fonterra, so Fonterra processes about 80% of the milk in New Zealand, a large cooperative owned by farmers, and where we wanted to be we wanted to understand about the farms that were in this, what we call, quadrant one. The farms that are high for profitability but have that low emissions intensity of production or low greenhouse gas emissions per unit of product.

And so when we looked at that data we found that these high profit, low emissions intensity farms were basically scattered all over the region. So this region is one of our main dairy regions in the South Island. And the good news is that you don't have to be in a particular place to achieve this low emissions and high profitability. It is decisions that you can make to either be in that place or move towards that place. So every farmer has an opportunity to do better in that space. So yeah, that was really a tremendous piece of work.

Key lessons

So when you identify a problem as we identified a problem with what we were doing, problems need a someone. Someone could be me or someone could be you. Some of the key tips I took away from this experience is that motion so that moving towards where you want to be was far more important than waiting until everything was perfect and all your ducks were in a row. As long as you're moving in the right direction, that's a great start.

If you want to get that scale across your organisation, the vision is great, but I think you need these other two things, the allies so you've got all people asking for the same thing and then getting that senior buy-in. As I said, they're the people that are going to write the checks for the infrastructure that you need. It's great to have a plan. My experience is that you also need a lot of persistence if you want to get to that impact. The bad news is that you may well, like me, need more persistence than you would hope that it would have required. Either way, it's going to need some persistence.

In summary, I would say we're all someones. If we identify a problem, we can all plausibly, I think, have much more impact than we realise if we really pull together a plan and put some back and persistence in behind it. Thank you.

If we identify a problem, we can all plausibly, I think, have much more impact than we realise if we really pull together a plan and put some back and persistence in behind it.

Q&A

Thank you, Mark. We have time for a few questions. The first question is from Scott. What are one or two surprising or unexpected insights you had in managing a departmental change to use Posit products, asking from an organisation going through a similar process?

Actually, some of the more surprising things were just recently when we looked to test some options. Could we more tightly integrate Snowflake and Posit Workbench, for example, to reduce that credential overhead of one cloud thing talking to another cloud thing? We've had some success to that, and I think that's part of our where we'll go in the future. You've got to be quite clear on what your requirements are. For example, Connect in Snowflake at the moment doesn't allow for public-facing products, dashboards, etc. That's a key piece of what we do. Just be clear on your requirements and you'll find the right tool and you'll find the right combination of tools.

Another question from Jorge. What advantages are there in using Posit Workbench for cloud computing versus using Snowflake or Databricks? If it's just aggregation, filtering, etc., you can push a lot of that compute back to Snowflake, but when you want to do simulation based stuff, there's some things that you can't push back to Snowflake because it's not designed to do that. Where possible, absolutely push it back to Snowflake. I think that's the most efficient and cost-effective way for the data compute, but for the simulation or more complex stuff, you're still going to need a chunky thing on the other end. It depends on what your needs are.