Resources

Ben Joaquin | Data Science in Meatspace | RStudio (2020)

"The Data Science community is dominated by folks doing amazing work with data that starts in and never leaves cyberspace. This talk is about best practices and playbooks for doing data science that involves meatspace (the opposite of cyberspace) and why R is such a great language for working with data that originated in the physical world. While the concrete examples in this talk will mostly come from the manufacturing space, where I have the most experience, I believe the themes are relevant to many meatspace workflows. We'll talk through effective playbooks that can help you navigate common tasks throughout the life-cycle of a project. We’ll also weave in how R’s glorious package ecosystem, including Tidyverse, can be combined with other languages like python, and with enterprise products like RStudio Connect to great effect. Specifically, we'll discuss practices in these areas: best practices for data collection in meatspace the importance of quantifying measurement system error collecting the correct data for training computer vision models the rarely discussed cost of maintaining models in production"

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thanks, everybody, for joining today. We're going to talk about data science in meatspace.

I have a couple of objectives with this talk. I want to share, for anyone that's interested in doing data science in meatspace, that it's incredibly rewarding if it's your niche. Niches are filters, but that's a good thing. If it's what you're into, it's going to be a heck of a lot of fun. And for those who are already doing work in this space, I want to share some of my experiences. Maybe it'll be helpful for you. And I think, honestly, a lot of this stuff is pretty universal. Similar workflows and paradigms apply, whether you're doing data science in meatspace or elsewhere.

So what is meatspace? It's the opposite of cyberspace. So it's the physical world. It's a really weird term. There's some confusion on the internet about where it came from. It's this word that's the opposite of a concept that it predates.

Background in manufacturing

So I worked at Tesla for a bunch of years. I worked on all of these vehicle programs. We did some other programs with Toyota and Daimler. We also made some batteries. And what I was working on there specifically was around manufacturing for the battery.

So what is the type of work that you're doing as a data scientist in a manufacturing place like Tesla? So some of the work involves setting test limits for quality control, right? Looking at empirical data, figuring out what the right places are to draw some test lines. Doing optimization. So there's process parameters, knobs, dials, levers. There's not actually any levers. But you have to optimize this process and put the right parameters in the process to make it work really efficiently to reduce costs or improve yields. A lot of the things we worked on were vision systems. Sometimes it was for measurement. There's some cool cases where we would take a part, image it, find locations of specific components, and then you could feed that back to a robot that was doing the process. We also did a lot of vision systems to do defect detection.

Now what I do is like what's more fun than robots assembling inanimate objects? Well, it's robots that assemble plants. And it's interesting. It's very similar in a lot of ways. I still kind of think of it as a manufacturing role. But instead of, yeah, assembling inanimate objects, it's sort of life support in combination with all the same automation equipment and robots. So I work at this company called Plenty. We're based in South San Francisco. And what we do is we grow plants indoors.

And we do kind of the same types of work. So there's a lot of work defining test limits. But it's unique here because we're doing phenotyping as well, right? So you and your sibling are both human beings and that's good. But you look different, right? And you have different morphologies. And so we need to capture those things in this context. We still have the same concept of optimizing our process parameters. But it's really about modeling plant growth and figuring out in this controlled environment agriculture how can we give it the right inputs to maximize growth and flavor and things like that. We also have a lot of vision systems for the same types of things. And there's some unique stuff, too, like I said about modeling plant growth. It's kind of like a longitudinal set of data that we have. We have a cohort of plants that we're growing and we track them over time.

What makes meatspace different

So is it actually different? This is the escalator that's right outside this door. And it's broken. That's a thing in meatspace. So systems age and degrade and their behaviors change over time. They fall to the sidewalk. Data isn't the default. So I'm pretty positive that this hotel doesn't have any telemetry coming off of their escalator to know that it's broken. Instrumentation is really costly. It's not because people are irrational. It's a rational decision. It's costly to put sensors out there. All measurement systems have error. You don't know what's in your environment. You can't list it out. You can't restart it.

Listening to Jenny's talk this morning, a lot of the debugging techniques that she described just aren't possible in a physical world. Dev and staging environments, you can't copy and paste. Consequently there's no version control. There's a huge variety in data structures. You can't make reprex. And there's bugs. Like literal bugs.

I personally have been responsible for automation equipment that stopped working properly because it impaled a bug. So this is like still a thing and it's like a part of the real world. And so one of the things that I think is really critical is that the data scientists when they're working in this space, they really have a responsibility to go and see. And there's this concept that comes from the Toyota production system, which I'm sure is not pronounced correctly, but that's this concept of going to where the work is taking place, being a part of it, seeing it.

One of the things that I think is really critical is that the data scientists when they're working in this space, they really have a responsibility to go and see.

The meatspace data science workflow

So what does this all mean? We're all familiar with this data science workflow that comes from R for data science. And now that you have some context about some of the bugs and tiles that fall in the world, let's see what a workflow might look like. So first the R for data science starts with first you must import your data into R. I just really haven't ever come into a production space and people are like, yeah, here's the data. You have to go get it in conjunction with someone else usually. So yeah, generally it doesn't exist there. And not only do you have to get the data, but you want to make good data, right? That's actually going to be useful.

So this is kind of a call to action that if you're working in this space, I think you're really going to find a great deal more success, a lot more value to whatever business that you're working in. If you can partner with the people that are responsible for the equipment or the production space or whatever it is that's happening in meatspace and work with them to make sure you're getting good data in. Because usually that's not their domain, that's not their responsibility. They're doing it because someone told them they had to. But if you partner with them, they want to do the right thing, they want to add value. So it requires partnership. And I'm not going to talk about the persist side of this. I'm just going to say please don't make your own database.

Prototyping first

So say that someone comes to you and says, hey, I'm going to go deploy this camera or something like that, will you please work on an image processing pipeline for me? You should stop them and say, no, no, no, first we have to prototype.

So however much you get paid per hour, whatever that is times your estimate for how long this project is going to take times two if you're good at estimating and you're wrong or like 20 if you're not good at it. Plus the opportunity cost of this project being a total flop is definitely more than the cost of a GoPro. So you should prototype it. And when you do prototype, you get to build some intuition for how solvable this problem is going to be. Which is really important.

This is I don't know where I first heard this, but I searched and it turns out that someone referenced a Ph.D. physicist saying if you never miss a flight, you're spending too much time in the airport. Which I think is great. And I totally miss flights sometimes. So what I'm getting at here is like, yeah, again, having intuition about how solvable your project is going to be is really important. You should make a conscious decision that the thing that you're about to embark on is speculative. It might not pan out. That's the right decision for you and your team and your business and all of that.

Instrumentation and calibration

So you prototype this thing. You're going to go for it. You want to kick off. You're going to need to have some instrumentation. And you want to be thoughtful about that. So instrumenting probably isn't driven by you. Generally. You're going to have somebody that you're going to partner with that's responsible for physical components. Definitely go with some industrial instrumentation. It can be really expensive, but it's totally worth the cost. Raspberry pies are cool, but they don't work that well when they're filled with water.

Integration costs often dominate projects. So if you're going to have to deploy new types of sensors or instrumentation in the future, try to think about that. Pick a platform that is going to be scalable appropriately with the work that you can anticipate. And always remember that measurement systems have to capture the signal that you're going to use. I've totally seen this happen many times. Someone gets a sensor, they plug it in, yay, now let's measure whatever it is and we're going to control on it. But the resolution of that sensor just doesn't include the signal that you actually need. And hopefully we'll figure that out from prototyping.

So just because you found gold in dirt doesn't mean you can turn dirt into gold. My friend and colleague Will Bishop says this. You're not an alchemist. You're not a magician. So, yeah. So keep that in mind. And make sure that you communicate that to other people. Because lots of people think that you are a magician or an alchemist.

Just because you found gold in dirt doesn't mean you can turn dirt into gold.

So here's a really cool example of us doing this really successfully. So imagine that you want to take a measurement from the bottom of the plant to the first leaf on the shoot. And that that process, I mean, typically, again, the default is that there's not instrumentation out there. So someone is physically holding that plant and physically taking a measurement device to it. So something we've done that's been really successful is, okay, cool, somebody wants to move that into digital space. So we're going to take a picture of the plant. And then we get the person to take that measurement from the picture. And what that does is it means you have to align the resolution necessary for that person to then do the task has to be in the picture in order for them to execute on that successfully. And so that can be a good paradigm to do that.

So calibration. Calibration is also really critical. So let's talk about that quickly. There's tons and tons of great content in the world about this. But you want to quantify your measurement system error. Every measurement system has error. You have no clue. Like, if you wanted to characterize the room temperatures, the temperatures in this room, whatever device you're going to use is going to have error. You need to understand what that is and what the caveats are.

I would consider using a derived metric. So, yeah, again, let's take the temperature in this room. It has some set point, right? There's a thermostat here somewhere that's trying to achieve a specific temperature. If you wanted to have an alert be thrown on the HVAC system performing out of spec, don't set that limit on the actual temperature. Set it as a margin on the set point. And then consider having a standard data model for your data. If you work with a lot of temperature data, you might want to have your own standard system or structure for that data so that different instrumentation approaches you take, different suppliers, you can merge all of that stuff and plug it into a constant approach, which you then can work on sort of agnostic of how you got that data to begin with.

And if you do all these things, I think there's an opportunity to create a virtuous cycle, right, where you're then collecting good data in the meatspace. You're able to put it into a database that you didn't create. And then you import that stuff. You go through your data science workflow. You build understanding. And then that has the ability to feed back into additional instrumentation and changes that might be made out in the real world.

And, yeah, I want to thank my team. I've worked with a bunch of really tremendous people. Will, Derek, Dana, who just gave a talk on scales, so you missed it, but you should definitely watch it. Chris, Sasha, Nico, Nate, and my wife, Corey. Thanks.

Q&A

Thanks, Ben Joaquin. We have a few questions queued up for you. Can you give an example of some of your work going through the whole meatspace plus data science workflow at your company?

Yeah, sure. Yeah. So a good example would be we have a piece of automation equipment at the current company where it's actually super cool. What it does is it has a drum. So imagine a cylinder that has a bunch of holes on it. And those holes, we draw a vacuum on a chamber that those holes are attached to. And then we rotate the drum over a bed of seeds, and the vacuum then sucks up seeds onto it. And as it rotates, it then drops the vacuum, and the seeds fall onto a tray. And so someone wanted to have us come up with a vision system that would allow us to detect the success of sucking those seeds up onto the drum. And we said, yeah, sure, that sounds like a cool idea, but please go take a GoPro and stick it out there and tell me if you can actually do that yourself. And what we got into there was that they hadn't thought about the fact that they actually cared a lot about the quantity of seeds that it sucked up. And what we were actually able to do, what the human was able to do with that video, was just to see if there was a presence or absence. But if you had two seeds or three seeds, it was actually incredibly difficult to tell the difference. And so we learned a lot there and were able to restructure the project based on that understanding.

That's really cool. Thank you. Oftentimes our coders aren't brought into the conversation early enough, i.e. after instrumentation is decided, et cetera. Any advice on how to approach that situation?

Yeah. I think my advice there would be to try to build those relationships. I definitely have lived that time and time again. I think it's often a secondary thought that you're going to then have some machine learning, AI-driven manufacturing facility. And so it's not a part of the conversation like this person's pointing out in the design phases. But I think the best practice there is to keep working with those people. They're trying to do the best that they can. They probably don't have a good understanding of what's possible. That's a big thing that we do, is we try to bring a solution space that the rest of the organization just isn't aware of, isn't thinking about, and bring that into the meatspace, into the world that they're trying to design and build. And yeah, I think you can iterate through that and make it clear that if we want to have the types of whiz-bang things that people want, it needs to be designed in and thought about at the beginning phase.

How do you switch to a career in meatspace if your background is all in tech slash virtual spaces?

Well, you can talk to me because we're hiring, especially if you're interested in computer vision. I think that the reality is that there's, we were talking today, there was a bird of feathers for manufacturing folks. And they're just, there's not a great community out there. So maybe we need to build that up. There's a number of companies that do things like manufacturing. That's where I have the most experience.

Fantastic. Thank you, Ben Joaquin.