
It's Abstractions All the Way Down... - posit::conf(2023)
Presented by JD Long Abstractions rule everything around us. JD Long talks about abstractions from the board room to the silicon. Over 20 years ago Joel Spolsky famously wrote, "All non-trivial abstractions, to some degree, are leaky." Unsurprisingly this has not changed. However, we have introduced more and more layers of abstraction into our workflows: Virtual Machines, AWS services, WASM, Docker, R, Python, data frames, and on and on. But then on top of the computational abstractions we have people abstractions: managers, colleagues, executives, stakeholders, etc. JD's presentation will be a wild romp through the mental models of abstractions and discuss how we, as technical analytical types, can gain skill in traversing abstractions and dealing with leaks. Materials: https://github.com/CerebralMastication/Presentations/tree/master/2023_posit-conf Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: It's abstractions all the way down .... Session Code: KEY-1161
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
You know the drill now. I'm about to introduce the next keynote speaker, JD Long, with a poem.
Introducing JD Long, a legend in the field. Open source wizard, his knowledge unsealed. Since 2002, he pondered with might. Open source tools, can they take flight? Look at risk management, he knows it well. 13 years deep, with stories to tell. With R and Python, he crafts his art. Jupyter Labs, our studio, all play their part. In Richmond, Virginia, he makes his home. With wife and daughter, they brightly roam. A recovering lawyer, a philosopher teen, in this vibrant family, dreams are seen.
You know, it's funny. I didn't realize Hadley was doing the poems before everyone. And this is like, I'm an artist who's been put out of business by Chad GPT, because in 2000, I don't know, 13 years ago or something, I spoke at the R and Finance conference here in Chicago. I did a lightning talk, because that's run out of here in Chicago.
And my gag was, I did it in Seussian rhyme, and it was hilarious, right? Because nobody comes to a freaking finance conference and gives a whole lightning talk in Seussian rhyme, and now anyone can do it, and they hardly have to work at all. Ah, so frustrating.
Well, I just want to say hi to you all. I love being here. It's exciting to be back in Chicago. I helped start the Chicago R User Group many years ago, and it started as, we had some guys, they were all guys at the time, R and Finance guys and myself, and we were going to Jake's Tap and talking about R, right?
Fantastically good times, and we decided we should just invite other people, right? We're having good conversation, and what we'll do is we'll do what we would like to participate in, but we'll include other people. And it's so lovely to see that the Chicago R User Group is still, you know, alive and well and kind of following that ethos of, you know, share the things you're doing and include other people, and that's really kind of an R community ethos and also, you know, an RStudio slash Posit ethos, and it makes for a fantastic community.
So speaking of that, like four years ago, I showed up at the RStudio conference at the time and gave a presentation, and I wore, you know, a shirt with this pattern on it because I tried to explain to people what I do for a living, and this was the best way to explain it because I'm an agricultural economist who works for a global reinsurance company, so it's easier just to say spreadsheets and bullshit.
So I need to, I do work for a financial services firm, so I work for Renaissance Free, and I am not my employer, nor am I representing them. I'm not discussing reinsurance here, and all the ideas here are mine unless otherwise stated. This isn't business, this is pleasure.
I was trying to think how long I've known Hadley and JJ, and I knew them separately before they were working together, and I stumbled on this. Back in 2010, for Hadley Wickham's birthday, so almost 13 years ago last week, I sent Hadley a copy of Generalized Additive Models, an Introduction with R, which had been on his wish list in Amazon because he was a poor, starving faculty at Rice University, and he was maintaining open source software, and not only that, he was answering my stupid questions on Stack Overflow.
And I wrote in the gift note, thanks for helping me kick ass with Plyer, now that's the precursor to dPlyer, Plyer and R, I appreciate the tools and the help you've given me on Stack Overflow. And this is the kind of community we have, right? And if anybody is wondering how you become a keynote speaker at Posit, the price is $68.78 and 13 years.
Abstractions and leaky abstractions
I had no idea when I started forming this presentation that Jeremy was going to be one of the keynote speakers. But I really enjoyed listening to JJ and Jeremy do this thing they called a two-way AMA, which I didn't know what that's called. I've been calling it having a conversation, and I'm just not really cool, right?
And one of the things that jumped out is they made the comment of if you think about how you help both new users ramp into things and make experienced users productive, you provide these abstractions, and there's a dial of how leaky you want the abstraction to be. Now, a bunch of us who've been around software, maybe worked in software engineering, but definitely talked about to software engineers, know this idea of abstractions and leaky abstractions.
In this community, a bunch of us come through clinical sciences or we come through other fields that aren't computation first, and I was thinking this term is really powerful, both abstractions and the idea of an abstraction leaking. These are really important concepts that I think we should more widely ingest.
So when Hadley contacted me and said, hey, you want a keynote? I'm like, yeah, what's themes or whatever? And he's basically like, I don't know, you use Python and R, maybe like some, I don't know, do whatever you're thinking about, right? Well, it just happened, I was thinking about this, and this has been one of the things I've been thinking about with my team and the people I work with, is how do we talk explicitly about abstractions, leaky abstractions, and how we deal with those leaks?
So let's talk a little bit about abstractions and leaky abstractions. So first thing I did was go back, I thought I knew where this came from, and I confirmed that while the term leaky abstraction was around in the zeitgeist, it really didn't get traction in the tech community until Joel Spolsky wrote this blog post over 20 years ago, and he calls it the law of leaky abstractions, and the law is all non-trivial abstractions to some degree are leaky.
and the law is all non-trivial abstractions to some degree are leaky.
And what he means by that is an abstraction failure, sometimes a little, sometimes a lot, there's leakage, things go wrong, it happens all over the place when you have abstractions, right? So it means that you have this thing you're relating to, it's abstracted away so you have an interface, an API, a calling to a function, and you interface with it in a way and it doesn't do what you expect. That's a leak.
So the question becomes, what do you do? So you've got, if you want to be a master of any abstraction, not just a user, but a master of the abstraction, you have to understand what's under the abstraction. Now that's the only way you can truly debug or truly understand an abstraction is to understand at least one layer beyond.
So when we think about abstraction, I want to expand what I mean, because many of you are probably thinking something like this, and I just like Googled computing abstraction, right? Where we have at the high level, you have like a high level language or an application, and then it goes through an assembly language program and the assembler turns it into machine code, and then it, like actually, then there's a bunch of hardware abstractions, and then an actual calculation gets done inside the hardware of the machine. That's how we often think about abstractions.
I want to expand that for the purpose of this conversation, because there's also organizational abstractions. So if you think of an abstraction as we've got some set of directives we're passing down, and we kind of don't care specifically how things get done in the next layer, we just want the next layer to do something, well that's not unlike organizational structures.
I come from corporate America, right? So I think, but this generally concept applies to your nonprofit or your software company or even your civic organization. At some level, we have a board of directors or some committee, and they pass down directives and priorities to the executive management. The executive management makes a bunch of choices and then passes things down to department heads who pass it to team leads, who pass it to team contributors, and if we continue to think about this, they then pass those on to computers in some way, right? Like they pass in because they use applications or they use high-level coding languages or something, and then all that other stuff from the slide before happens underneath this. It's abstractions all the way down, right? There's the title of my talk.
So why though, like why do we need these abstractions all the way down? Wouldn't it be easier if we just had, like I grew up on a farm, and the great thing about farming is you have to do everything because you don't have staff. The worst thing about farming is you have to do everything because you don't have staff, right?
So you become a master of every level of abstraction until you get in over your head and you have to have John Deere repair some piece of machinery or something, right? All the way up and down the levels of abstraction, you become at least proficient. It's been hard for me to get used to organizations that didn't expect me to run up and down the stack, right? Because I want to run up and down all the abstractions.
Well, the reason we can't always do that was really articulated back in the 50s by Herbert Simon. He's an economist, I'm an economist, so I've got to get economists in here. He had this article in 1957 called Administrative Behavior, A Study of Decision-Making Processes in Administrative Organizations, and he coined the phrase bounded rationality, which I TLDR as head trunk only hold so much junk, which Gary Lawson captured in this cartoon where it says, Mr. Osborne, may I be excused? My brain is full.
We can only handle so many levels of abstraction and so many pieces of the stack before our brain overflows, and we can't make sense of all the pieces. And so we build these interfaces, and even if we are the person traversing the levels of abstraction, we would like to interface with different pieces of them and not think about what happens below them, even if we wrote what's below them, because it means when we're problem-solving here, we don't have to think about how the read-write is happening on the database. That just magically happens behind an abstract interface, and we don't have to think about it, and it allows us to work at the problem-solving level that's appropriate for what we're trying to accomplish.
What abstractions are and are not for
So let's talk a little bit about what abstractions are for and what they are not for. So one thing I want to point out that they're not for is they're not for gatekeeping, and I see this being done a lot, right? You're not a real data scientist unless you, you know, PyTorch or deep learning or whatever. Those are all different abstractions, different tools, that are used in certain places to solve certain problems. Those may not be your problems. Then you don't need to know that abstraction. You don't need to be a master of that abstraction.
You don't need to be one layer below, and I watch a lot of early learners run around learning abstractions, learning tools, because they feel like if they don't know this tool, they're not a real whatever-it-is-they-think-they-want-to-be. That's really toxic, because you'll wear yourself out because this guy has already said you can't fit it all in your head in a really useful way. So don't let the learning of abstractions be like, oh, once I accumulate a big enough toolbox of these abstractions, then I'm a real whatever. That's just gatekeeping, and cut that out.
You don't even have to know all the abstractions you use deeply. But you do need to know your limits. So know which abstractions you really understand. Recognize when you're up against an abstraction that you don't grok, so you don't understand the abstraction. It's a breakpoint. You're like, I don't really understand what's going on beyond here. At that point, you have a choice. You can either learn that abstraction, learn what's really going on beyond it, so that you can deeply understand it, or you can partner with someone who's an expert there.
Partnership and pairing with someone else and working with someone else is always an option. It may be harder in some organizations than others, especially if you're the only person on the data science team. You may feel some pressure to learn those abstractions, and that may be the right choice. But if you're in a larger organization, and someone else in the organization is a master of that abstraction, you may not need to become the expert on database indexing.
Now, what I see happen a lot is people blame an abstraction for problems when they bump up against it. And often the problem is, they don't understand what the abstraction is doing. Now, that may be a leaky abstraction, but still, it's like, okay, my dashboard doesn't refresh fast enough. My database has a problem.
I had literally this one within the last year. And I worked with the Power BI developer, and we discovered that when using direct query in Power BI, which is how you access any database that isn't the one that's built into Power BI, it issues all the queries in series and will not issue them in parallel. So the dashboard had, you know, 13 queries that all took three seconds. They could have been run in parallel, and the whole thing refreshed in three seconds. Instead, it took 13 times 30, because it was running them in serial.
And the analyst had thought, oh, this database is crap. And I'm like, actually, Microsoft is shunting you into buying more cloud storage so you can shove your data into their platform instead of letting you use your own perfectly good database because they force your connection to issue queries in serial. And I'm like, that's broken, right? That's not a leaky abstraction by accident. That's a leaky abstraction by sales, right? And that should make anybody that runs into that one angry.
That's not a leaky abstraction by accident. That's a leaky abstraction by sales, right?
But that's an example of don't blame the abstraction. Understand the abstraction. Understand what's going on. And then decide what you want to do with that information.
So I would, if I did anything where I dabble with, like, computer sciencey concepts and I don't quote Edgar Dijkstra here, I would probably be remiss. So he has this kind of great quote that says, Programming, when stripped of all its circumstantial irrelevancies, boils down to no more and no less than very effective thinking so as to avoid unmastered complexity to very vigorous separations of your many different concerns. Right? And the TLDR is, constrain complexity and separate concerns. That's what we're trying to do with abstractions.
Floating-point math: a leaky abstraction
So, let's do a fun example about thinking about abstractions. Electricity is a fundamental abstraction for computing, obviously, right? We know this is all powered by electricity. We know, like, we've got gigabit switches in our office, so that's a billion bits of information a second.
So, just think about this for a minute. How fast do the electrons flow in our wires? Yeah, no, it's 8 centimeters an hour at 1 watt. What? Right, okay, who was surprised by that number? Like, it kind of defied your intuition. The problem is we conflate electromagnetic fields with electron movement, so the electromagnetic field moves really fast. The actual electrons move quite slow.
But why do many of us not know this, right? Because we use electronics en masse every day to solve all our problems. Well, the reason we don't know this is it doesn't matter. It doesn't change anything. There's absolutely nothing you ever know where the movement of the electron matters. That abstraction never leaks.
So, buckle up. Let's talk about floating-point math. This is my favorite leaky abstraction. And by the way, I'm only going to have two slides with code, and they're both Python because we've got a big community. We're a very open tent.
You know, if you took, you know, grade school math, everyone in here did, at some point, the teacher, and if they were very creative, they did a pretty slide like this one to explain the associative property of addition where it doesn't matter how you group the things you're adding, you get the same answer. Very young. And this is how addition and multiplication work.
But, we come into Python, and we go, okay, well, we got these numbers. They're real numbers. They're on the number line, right? We're going to take 1.11, add 2.22, and then we're going to add 3.33, and we're going to group them a little different, and they are not the same. And the reason they are not the same is floating point addition and multiplication are not associative.
And the reason why is illustrated when I format the print on these to show 17 decimal points. Because way out there in it-doesn't-matter-practically- to-you land, most of the time, out there in the 14, 15, 16, 17th decimal point, you can see these numbers are just a little bit different. And that's because floating point numbers are an abstraction. They're not literal point on the number line like we were taught in grade school about numbers. They're something different to make it work in computers.
And I'll tell you a very quick story about where I spent two days with an engineer last week, then it was a floating point math problem. But it didn't come at us as just value A equals value B. What we were doing was we were doing a bunch of math, and then we rank it. And after we rank it, we then say, okay, I want the row number, and we use a second sorting number to say, basically, if there's ties, use this other value. Use this deterministic number over here to break ties.
Because we want all of our systems to be deterministic, meaning that there's no random number generation here, so every time I run it, I want it to barf out exactly the same answer. No swapping values around, right? This is huge. It ingests like 5 billion records and barfs out 11 million after a bunch of aggregations. And we were finding we weren't getting deterministic answers. Now, out of 11 million, maybe we would have 30. That would change every run.
And we were like, that's not cool. What's going on? And as we dug into it, what we found is some of the condition that should have been a tie was like this situation where I was showing you where it's conceptually the same number, but they were flipping because, and the reason it was happening is we were on Spark. Spark's a distributed system. You don't control which executors or how the groupings are done because it's map-reduced. So every time we ran it, and maybe we ran it with a different cluster size, maybe one machine got the data first, little things change, and so when we aggregated these up, we'd get slightly different floating point rounding because of the way they were grouped.
And it became material because of that sort order problem. We thought we were handling ties with numbers, but we weren't. So, I mean, that was easy to solve, right? We just round those to some smaller number, you know, or we cast them into something with less precision, so we get rid of that noise out in floating point land. You know, but we spent two of our time for like, well, I think there's more people than that involved, for like four days trying to figure out why the hell these numbers keep changing, right? Why is there a ghost in my machine?
Organizational silos as leaky abstractions
I'm an agricultural economist. Let's talk about silos. So, Cara mentioned silos in organizations, right? And so let's think about using our mindset of an abstraction, let's think about silos for a minute. What are if organizations are abstractions, what's the abstraction equivalent of an organizational silo? And I would say that's a over- restrictive abstraction, and it's an abstraction that when it leaks, you can't figure out why.
So, let's think what that, what org abstraction leakage looked like. So, where did this number in the DB, in the database, where did it come from? If your organization is highly siloed, you may not be able to answer that question if your team didn't calculate it. That's an overly rigid abstraction. Who do I talk to about solving this problem, whatever that may be. If you can't answer that, you've got a siloed organization, right? It's overly rigid abstraction.
So, I think of a bunch of this as really, in an organization, it's a communication problem, and communication is like beer. Communication is the source of and solution to all our problems.
So, you know, if you run into where did this number in the DB come from, there's ways that we can improve communication in our organization to answer this, right? We can have the code for the ETL or whatever in version control, so Git, GitHub, whatever version control you use, and available to the people in the organization who consume the values, right? And that's really important because a lot of organizations do not share, read only is fine, but they don't share the ETL process for how they do things with the people who are consuming those things. That's insanity!
So, similarly, like the question, how do I talk about solving this? Well, if everything in the organization has clear ownership, that one gets resolved really fast because you talk to the owner of that process or that system, and we often let ownership slip and things get orphaned, and then they just leak, like, all over the floor because there's no person whose job it is to explain why it's leaking or help stop the leak or help you understand what the system is doing.
Alright, with that said, that's a little bit about organizational abstraction leakage and how we might address that. I want to do a little history, and it wouldn't be a data science conference with an old man at the front if we didn't put up Drew Conway's Venn Diagram from 2010.
Now, funny story, I knew Drew when he was a PhD student at NYU, and I saw him, like, draw this out on a napkin, and it, like, completely blew my brain when five years later, like, the actuarial magazines on my actuarial department desk at work, like, had this diagram on the cover. I'm like, this is so weird. So, sometimes I feel like I'm the Forrest Gump of data science.
So, what's interesting about this, and the reason I want to bring it up, is the thing that was so revolutionary about this in 2010, inside of organizations, is these are three different abstractions. And they, one person was, not only wasn't expected to do this, one person in many organizations was not allowed to do this. I can remember many times in the early 2000 being told, I couldn't have coding tools on my machine at work because I was not a software developer.
And what happened since then, so in only, you know, 13 years, is it's now totally acceptable, almost everywhere, except a few backwards government organizations, for people who are doing like data science-y work in the business, doing analytical work, to have first class programming tools on their machine, and they have rights and permissions to use those, and often install packages, and even if that's from a curated internal repo.
I will assert that the single biggest business value derived from the data science movement in the last 13 years is making it legitimate to code outside of IT software development roles. Now, that's a pretty big assertion, right? A lot of value has come out of data science. But this is the, inside of these big calcified organizations, seeing all Drew Conway's diagram inside of magazines, and executives say, well, how do we get that? And I look over their shoulder and say, stop being stupid about the rules of what we put on our desktop, or give us access to, you know, coding tools through a JupyterLab, or through RStudio that's centrally hosted.
So these data science roles break previous organizational abstractions is my point there, right? I'll make a second assertion, and it's tied into what we talked about earlier. Abstractions will leak. Spolsky told us that. Therefore, abstractions must be permeable to allow debugging. So that's my thing of, put your ETL code where everyone can read it who might be using the results of it. That's allowing us to at least peek through the abstraction and see what we're getting.
The 80-16-4 framework
My assertion three. No single abstraction is right for everyone. I think we're going to need more abstractions. Now, we have found in our organization, and this is the part where I'm going to share some things that's not just my thoughts. My colleague, John Moore, and Peter Ilston, another one of my colleagues, came up with what we call the 80-16-4. And it really started as the 80-20 rule, and then we realized we needed to divide the 20 into two buckets, so we applied the 80-20 rule to the 20, and we got 80-16-4, because math is fun.
So the way we think about everything we do is 80% of our users in our organization are normal business users. Now, I'm going to present all this from a for-profit corporate organization. These should project into your world, and don't, even though I've got numbers up here, don't be obsessed with what that means. This is conceptual. The vast majority of your folks are normal business users. They want to use applications, dashboards, basic Excel. That's their tool stack. Those are their abstractions.
And then 16% or so are super users. Super users are going to want to do SQL against the underlying data that's collected by your tools. They may want to do a custom dashboard. They're going to do some advanced Excel, like pivot tables, and really processing the data a bit more. And then 4% are guru users. Your guru users are like, oh, I want to do that, but I want to do it 10,000 times, so can I hit the API and just do that directly, pull a result back, make a change, put it in, run your thing, pull it back? You know, that's a very different use case. Or maybe they want a library, a Python library, an R library that interacts with your corporate tooling.
Very different abstractions needed by these three different groups. And they're cumulative, right? So the way I think about it is your super users, they will use the dashboards and the basic Excel, but they also want these other things. Your guru users are, of course, still using SQL, right? And they're still using your applications. They're using everything.
This has been incredibly helpful for us because what I had observed in our organization and in others is a tendency to, like, scratch the 80% itch and then kind of stop and make it hard to get the underlying data. And we had had at times in our culture where that wasn't the case, and then as we started specializing teams, some of this 16 and 4 was getting dropped sometimes. So, you know, we've changed kind of the definition of done on the things we build, saying we've got to make sure people from all three of these groups agree that they have what they need.
I really like the current movement for data products, because data products are used, data product meaning, like, it's defined, it's got a data dictionary, it's not, like, tied in necessarily to the whole corporate database, but it's a standalone data product, all the fields are defined, and in our shop, we have a link to a wiki page, and a link to the source code that builds that data product available for everyone. I love that.
Like, even though when I first heard data product, I thought it was, like, more marketing BS, and I just wasn't excited about it, when I saw what I got from it, I'm like, oh yeah, data product, I want that. Because a whole bunch of my 16 and 4's needs are addressed by having a good data product, right? Because it's a good abstraction that we can peek behind, because of the documentation and the links to the source code.
So, like, the point there is multiple abstractions is a high empathy move, right? I'm here with my practice radical empathy shirt on. That's high empathy for who's actually using it. Building APIs blindly because Gartner says that's a good idea is not a high empathy move, right? That's just parroting, you know, what someone you believe to be a thought leader. Don't do that. Think. Right? Think about what people actually need, and build what they actually need. And more importantly, don't build stuff they don't need.
Think. Right? Think about what people actually need, and build what they actually need. And more importantly, don't build stuff they don't need.
Big idea recap
And that's hugely empathetic. So, my big idea recap is abstractions start way up with leaders and go all the way down to hardware. It's abstractions all the way up and down. To debug an abstraction, you have to see what's below it. We're building mental models of complex systems, and that's why we need abstractions, because we can't hold
