Resources

Data Science at the Command Line and Polars | Jeroen Janssens | Data Science Hangout

To join future data science hangouts, add it to your calendar here: https://pos.it/dsh - All are welcome! We'd love to see you! We were recently joined by Jeroen Janssens, Senior Developer Relations Engineer at Posit, to chat about his career journey from machine learning to developer relations, the advantages of using the command line for data science, his books "Data Science at the Command Line" and "Python Polars", and advice for aspiring DevRel professionals. In this Hangout, we explore the benefits of working on the command line versus not. Jeroen explained that while the initial command line interface might seem stark, it offers a very different and powerful way to interact with your computer. The Unix command line is ubiquitous across various systems, from Raspberry Pis to supercomputers. Its strength lies in the ability to connect tools together through standard output and input, allowing for quick and iterative solutions by combining specialized tools. This fosters an interactive nature with a short feedback loop and provides closer interaction with the file system, making ad hoc data exploration efficient. Resources mentioned in the video and zoom chat: Jeroen's LinkedIn → https://www.linkedin.com/in/jeroenjanssens/ Data Science at the Command Line → https://jeroenjanssens.com/dsatcl/ Python Polars: The Definitive Guide → https://polarsguide.com/ Plotnine → https://plotnine.org/ Winner of the 2024 plotnine Plotting Contest → https://posit.co/blog/winner-of-the-2024-plotnine-plotting-contest/ Talk about plotnine → https://www.youtube.com/watch?v=xdD8r84sqYY R for Data Science → https://r4ds.had.co.nz/ Jeroen's plotnine translation of R for Data Science → https://jeroenjanssens.com/plotnine/ froggeR package → https://azimuth-project.tech/froggeR/ Reticulate → https://rstudio.github.io/reticulate/ Install Windows Subsystem for Linux (WSL) → https://learn.microsoft.com/en-us/windows/wsl/install UTM for macOS (Virtualization) → https://mac.getutm.app fish shell → https://fishshell.com/ Quartodoc → https://github.com/machow/quartodoc Focusmate (Accountability Partner Tool) → https://www.focusmate.com/ Surface Area of Luck → https://modelthinkers.com/mental-model/surface-area-of-luck CRAN R Extensions Manual → https://cran.r-project.org/doc/manuals/r-release/R-exts.html If you didn’t join live, one great thing you missed from the zoom chat was people sharing their varied experiences with the command line, with many admitting they primarily use it for basic navigation or only when necessary, and some sharing helpful tools and tips for those less familiar. Let us know below if you’d like to hear more about this topic! ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co Thanks for hanging out with us!

Mar 20, 2025
57 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome back to the Data Science Hangout, everybody. If we have not met, I'm Libby. I'm a Data Science Community Manager here at Posit. And I'm also a really passionate data science educator. I have a background in business statistics and experience teaching both Python and R. And I'm based in San Antonio, Texas, where right now it is warming up, turning into spring, flowers are blooming.

If you're not familiar with Posit, Posit builds enterprise solutions and open source tools for folks who do data science and R and Python like me and probably maybe like you. We're also the company formerly called RStudio. So if you've used RStudio, then you've definitely used Posit. And if you've used something like Shiny or the Tidyverse, then you have used a Posit ecosystem package.

The Hangout is our open space to hear what's going on in the world of data across different industries, because we are all from different places. It's kind of the magic of the Hangout. And we get to chat about data science practices, data science leadership, and connect with other people who are sort of in the same spaces we are sharing the same experiences that we are. We get together every Thursday, same time, same place here on Zoom.

On Slido, you can put your name on your question. You can also leave it anonymous. There's a little space to put your name. And then also in Slido, you can upvote questions. So if you see one that someone else asked, and you're like, oh, that's a really great one, you can vote for it there. And it will pop to the top, bubble to the top of our little screen here in Slido.

Introducing Jeroen Janssens

I am so excited to introduce our co-host today, our Feature Leader, Jeroen Janssens. He is a Senior Developer Relations Engineer at Posit. If you would introduce yourself, tell us a little bit about what you do, your background, and something you like to do for fun.

Hey, thanks for having me. And yes, my official title is indeed a Senior Developer Relations Engineer. And that's quite a mouthful. I've been practicing these past few weeks getting it right. So I am based out of Rotterdam, the Netherlands, where I live with my wife and two kids. My background is in machine learning and artificial intelligence. I studied that a long time ago. It was nothing like the AI that we see today. But as I was finishing up my PhD, and I think we're now talking about the end of 2011, I got the opportunity to work at a startup in New York City as a data scientist.

This was a term back in the day, it was brand new, wasn't even known on our side of the pond in the Netherlands. But hey, it seemed to align well with what I did during my PhD. So I moved there and eventually worked there at three different startups over the course of two and a half years. And during that time, I also got the opportunity to write a book, my first book, Data Science at the Command Line. And when I came back, I was invited to give a workshop about this wonderful topic, how to process data using the Unix command line. And I noticed that I enjoyed this a lot.

So I started doing this more and more often, also about related topics related to R and Python and things like machine learning, data visualization, until I decided like, hey, let's start a company around this. So that's when I founded Data Science Workshops, which I've had for seven years. Now at a certain time, the pandemic hit. And, you know, teaching was no longer as fun as it used to be. I struggled a lot with getting new clients. So I decided to change things up a little bit and become an employee again. This is when I joined Xomnia, a small data consultancy based out of Amsterdam. This is where I got to know Polars, which we can talk more about later.

And yeah, and a month and a half ago, I joined Posit. So I believe we joined around the same time, right? I think we might have joined on the same day, actually, as part-time employees. Jeroen and I are Posit twins.

Fun, fun. Yes. I remember. Because, of course, the past year and a half, I've been spending almost all my free time on my new book, Python Polars. But I really enjoy woodworking and doing things around the house. Every now and then I play some Zelda. Those are things that I, yeah. So I just started Tears of the Kingdom. And wow.

Writing Data Science at the Command Line

I wanted to clarify Data Science of the Command Line. That one's free to read, right? Yes, it is. At a certain point, I asked O'Reilly whether I could put it online. And I was actually inspired by R for Data Science by Hadley Wickham and Garrett Grohman. They had done this. And I felt like, wow, isn't that great? And at a certain point, so the first edition I was allowed to put online. And then the second edition was written out in the open from the start, actually.

What motivated you to write your first book? I'm curious because you said that you got to give a talk about it. But then what's the process of actually pitching a book and showing the sort of social proof that you have enough people looking at you to publish a book?

Yeah. So for Data Science at the Command Line, this is how it started. I was invited to give a talk at the New York Open Statistical Programming Meetup hosted by Jared Lander. He's been doing this meetup for the past, what, 15 years. And he said, like, hey, would you like to give a talk? Like, what am I going to talk about? And I noticed that I, in contrast to my colleagues, I was using the Unix Command Line a lot in order to do various tasks for getting data, cleaning data, and whatnot. And I thought, like, hey, I can give a talk about this. And as I was preparing for this talk, I was like, hey, why don't I go ahead and write a blog post about this, right? Two birds, one stone.

And this blog post, this is where it all started. This came out in September 2013. It got to the top position at Hacker News, stayed there for at least 24 hours, hundreds of comments. I'm like, okay, this means something, right? This is definitely, there's definitely more here. So that's when I was put in touch with someone at O'Reilly, expanded, well, thought a little bit more, okay, if we're going to expand that blog post into an entire book, what would that look like? Wrote a proposal, and yeah, that got accepted. And then 10 months later, the book came out.

It got to the top position at Hacker News, stayed there for at least 24 hours, hundreds of comments. I'm like, okay, this means something, right? This is definitely, there's definitely more here.

Yeah, yeah, it was quite fast. I was, I just finished writing my PhD thesis. So I was still having this momentum, waking up at 5am, writing for two hours and then go to my day job, which of course my wife didn't like pretty much. She thought I was done after writing the thesis, but no, here I am, I have to write another book. So yeah, that was quite quick. And I unfortunately, the second edition took longer, took a year. And well, and now we just finished Python Polars. That took over a year and a half.

The motivation behind Python Polars

The motivation for that book was a little bit different. When I joined Xomnia a little over two years ago, I saw this guy working behind his laptop. He was intensely focused, while all the others — this was a Friday, and that's when everybody at Xomnia comes together. Everybody was already having fun, but he was laser focused on his screen. Turned out that this was Richie Fink, the person who created Polars. So he used to be my colleague, and also Thijs, he used to be my colleague at Xomnia. So this is where I first learned about Polars.

And Thijs and I, we were actually working at the same client, in the same team, and there we had the task to speed up their pipeline. So it was a very big code base, there was R involved, and lots of Pandas code, and we had to speed that up. And then I felt like, hey, why don't we use Polars for this? Because I already noticed that Polars was pretty fast. You could say blazingly fast, as they often say.

So as we were getting to know Polars better, Thijs and I were like, hey, wait a minute, this is so cool, this package is so not only fast, but also a joy to work with, thanks to its API, this deserves a book. And I was having a meeting with someone at O'Reilly anyway, so then I thought like, hey, let's just test the waters, see if there's any interest in a book about Polars.

So this was over a year and a half ago, and O'Reilly hadn't yet heard about Polars. Well, they had already received four proposals for a Polars book, which all got rejected. But they weren't, you know, they weren't really convinced that Polars was going to be a big thing and deserved their attention.

Now they realize that they're wrong. And, well, they're very happy that we're finally finished with the book. So, but yeah, Thijs and I, we wrote a 20-page proposal in order to convince O'Reilly, and at that point, they could only say yes. And that's when this endeavor began.

Benefits of working on the command line

What is the benefit of working on the command line versus not working in the command line?

Yeah, so the Unix command line, when you first open it, when you first see it, you're greeted with this, well, usually a black box, and this prompt and the cursor that's blinking at you. And it's not very inviting, right? It's a stark and harsh environment. But once you get to know it a little bit, right, once you get to know a command or two, you'll realize that this is a very different way of interacting with your computer, right?

You don't have to use the Unix command line. I keep saying Unix command line, but I've been on macOS for the past 15 years. Now, macOS is, of course, derived from Unix, or you could say it's still a Unix. But, and even on Windows, there are various ways where you can still get the same command line. So it's important to make that distinction. I'm not talking about the DOS command prompt or the Windows PowerShell.

The Unix command line, it's, well, first of all, it's everywhere. It doesn't matter whether you're logging into your Raspberry Pi, your router, or some supercomputer. Chances are is that it's running Linux, and that you then have a command line available to you. So if you know how to work with the command line, that opens up, yeah, a lot of different ways, or a lot of possibilities, I should say. From that command line, you can invoke other tools, right? And it doesn't matter in which language that tool is written in, whether it's Python or C or JavaScript or R or Go or Rust, the command line doesn't care.

But then here's what really makes the command line shine, is that it has this mechanism of connecting these tools together. So whenever a command line tool produces output, right, it writes to standard output, then you can pipe that to another tool, yeah, which then, assuming it would then read from standard input. And so you can imagine that if you have a collection of command line tools that each do one thing and do it really well, right, that's also part of the Unix philosophy, then you can stitch those together and very quickly and iteratively come to a solution.

So it really stimulates this interactive nature. You have a very short feedback loop, right, which is kind of similar if you're working from the Python console or the R console. But then now the command line is very, very, and as you can notice, I can talk about this all day, it's so much closer to the file system. So interacting with files and directories becomes a lot quicker. And that's, so when you're in this ad hoc mode, you're just starting, you know, with this new data set that you want to get to know better. That's where I often use the command line.

Can I add two things to this? If you're really worried of ruining your system, right? If you're worried that you're going to break anything from your laptop, there is the possibility of doing this in a virtual environment or a Docker container. So that might help, right? Mentally speaking, right? And I've also noticed since the command line has been around for so long, over 50 years, your favorite large language model is probably pretty good at helping you either producing a command, or if you have a command that you're not exactly sure what it does, interpret it for you. So yeah, that might be helpful too.

Focus time and writing tips

When you were talking about how quickly you were writing your books, I was thinking like, there's probably so many things that a lot of us have in the back of our mind of things we want to do, or maybe books we'd want to write. And I was wondering, like, what kind of focus time rules do you need for yourself? Or do you have any focus time tips for us?

Right, yeah, that's a tricky one. Because what may work for me may not work for someone else. And I also know that some things that have worked for me only worked for a short amount of time. So I had this period, I believe it was half a year, where I woke up every morning at 5am. I had someone else, an accountability buddy, he was working on something completely unrelated. But we were both in a Zoom call, sharing our screens, and then just working. And having a fixed period really worked well for me, especially with having two kids. And especially when no one is yet awake.

So when it comes to time that has worked for me, I was lucky enough that I, you know, had the occasional weekend day to work on a chapter, especially when I was behind some deadline. Deadlines definitely are your friend. If you are taking on writing a book or a similar endeavor, one thing that I can recommend is that you set deadlines for yourself.

What is DevRel?

What is DevRel? I think DevRel means something different at every company. I think if you, similar to data science, right? If you ask 12 people, you get 13 definitions back. And so DevRel for me at Posit, it means a couple of things. It means that I get to advocate Posit's polyglot data science mission. I very often create documentation or videos or other resources. I contribute to open source, whether that's to packages being developed within Posit, like plotnine, or the broader PyData ecosystem. I get to interact with the community and, you know, every now and then give a talk or host a workshop. So that's how I view developer relations.

You're sort of in between users and developers. But now, since I also want to interact with the larger PyData ecosystem, I'm also in between Posit's developers and other developers, package maintainers. So, yeah. I think the blessing and the curse of this role is that there's just so much to do.

Getting noticed and building a career in DevRel

What are some of your biggest tips for applying and getting noticed in this space or just establishing yourself in this space?

Yeah. So getting noticed. That's already the advice. You need to allow others to notice you. And one way you can do that is to work out in the open, work out loud, as we said, Libby, right? Related to this, right? And you can do this in various ways. You can post regularly on your favorite social media platform, or you can keep a personal blog, right? You can broadcast it, or you can talk to individual people. Say like, hey, I've been working on this. What do you think?

Related to this is try and see if you can already work with the people that you want as colleagues. So to join Posit, and I'll share my secret, this took close to a year of getting this job. And the reason why it took so long is there was never a formal vacancy, and I never formally applied to this. What I did instead here is at a certain point there was a contest for plotnine, right? A data visualization package being developed at Posit, and I've been a fan of plotnine for a long time, and I knew like, okay, I've got to do my best here.

I became a runner-up, which was great, and somehow I got the opportunity to talk about this at PyData Amsterdam, and what I learned later is that it's these little things that get noticed, and people start hearing your name more and more often. Now another thing that I did is I gave a workshop about plotnine at PyData New York, and what I did is I said to Michael Chow, he is the whole reason why I became interested in Posit about a year ago, he had sent me a message about something. I asked Michael, like, hey, would you like to join me on this workshop? So now it's no longer just me representing Xomnia, it's also at this workshop there is Michael representing Posit, and it's these little things that definitely, well, didn't hurt.

And so when the time came that there was actually a spot at the developer relations team, at that point, well, as I've been told, the team, well, my manager, he was actually, yeah, he knew that it had to be me. I mean, it sounds weird saying this, and believe me, I am incredibly lucky and happy the way it all turned out. And just to give you some counterexamples, I have applied many times at other companies that didn't work out.

Which is perhaps even more of an argument of taking a different route, yeah, instead of looking for vacancies, applying over and over again, driving yourself nuts.

Related to this is try and see if you can already work with the people that you want as colleagues. So to join Posit, and I'll share my secret, this took close to a year of getting this job. And the reason why it took so long is there was never a formal vacancy, and I never formally applied to this.

Polyglot data science and expanding your toolkit

Yeah. So, again, it's the idea that we have all these tools out there, right? Whether they are small command line tools, or big software packages, or programming languages. I think it's important to not constrain yourself to a single tool, because every tool has its strengths, right? And being able to harness those strengths, right? To borrow some functionality from another language into the main language where you're working in, that's a superpower, right? Imagine that you are working in R, you have an entire analysis in R, but then there's this specific algorithm which is only implemented in Python. What do you do? Do you need to rewrite your entire analysis to Python? No. There are various ways in which you can, you know, combine the two. reticulate comes into play, for example. That's a package that you can use to incorporate Python into your R script. And there are lots of other ways in which you combine these things. Of course, also using the Unix command line. But I think it makes you, on a whole, more flexible.

I think it's really important that you have an actual problem to work on. I totally agree. I'm so glad that's your answer. Because I have done the same thing, right? Like I would never have gone towards JavaScript to try to learn JavaScript. If I did not have a problem to solve, if I didn't have to figure out how to use jQuery when I was working in Shiny, and if I didn't have to figure out how to use Google Apps Scripts, which is server side JavaScript, but I had specific problems to solve and I don't think I would have gone end to end to try to add that to my toolkit otherwise.

Pivotal moments and influential people

Looking back, are there any sort of pivotal moments or influential people who really significantly shaped your path in data?

I owe a lot to my thesis supervisor, Erik Posma, who was my supervisor for not only my PhD thesis, but also my master thesis and before that my bachelor's thesis. We worked together for eight years. So, of course, I owe a lot to him. But then there are also people that I have known for a very short amount of time, or interacted with for only a short period, that just did this one thing. For example, I was a visiting researcher at Cambridge University in the UK. And there I became friends with Ferenc Huzar. And he is the one who introduced me to this startup in New York City. And if he hadn't done that, I wouldn't have gone to New York City. I would have never met Jared Lander, who invited me to speak at his meetup. I would have never even dreamt of the idea of writing a book.

Michael Chow is yet another one. If he hadn't sent me that message a year ago to discuss a blog post about plotnine, we probably wouldn't be colleagues at this moment. So, yeah, yeah, a lot of different people.

I feel like, too, I mean, I think this goes back to what you said earlier about putting your work, like, working in the public, that, like, on the one hand, it's kind of serendipity that I reached out. On the other hand, it's not because you put out so much public work. Like, I found your room translated part of R for DS, R for Data Science, to plotnine. So, all the visualization to use plotnine and Python. And that's, like, an incredible bit of public work that just, like, I had to reach out. So, I think serendipity on both ends, like, you're putting out public work, was just really inspirational.

Thanks, Michael. And I think there's a term for this. It's your luck surface area that you have to increase.

What Jeroen learned writing the Polars book

What unexpected thing did you learn about Polars through writing your book? Did you already know everything there was to know about Polars before you started?

No, but that's, on a meta level, that was one of my biggest realizations of writing the first book. You see, I thought I knew a little bit about the Unix command line. I'll get to the Polars book in a moment, but I thought I knew about this. But as I wrote this, I was like, oh, wow, there is so much to learn here. And of course, feeling like an imposter all the while, you know, after a while, I started to realize, like, hey, you don't have to know everything in order to write a book. In fact, I would argue it's even better not to learn, not to know a whole lot about the thing you're writing about.

So I discussed this with Richie Fink, the creator of Polars, and we both agree that he would probably not be the best person to write a book because he's so deep into it, right? And of course, he wants to just focus on building Polars itself. So going to the Polars book, I knew, I didn't know that much about Polars, but we were able to use it on the job, right? We had to convert this big code base from Pandas to Polars and R to Python. And I just knew that as long as we were one step ahead, we could figure things out as we go and we could write this book.

When you are writing a book, you take a somewhat of a higher level view and then you start to notice missing things, right? Polars is back, especially a year ago, very much under development. So you start to notice missing functions or inconsistencies or why does it work this way?

What I also noticed is that the developers behind Polars, so that's both the Polars team, but also the core contributors and all the volunteers around it, is that they're a friendly bunch and they're very responsive. So they're mostly hanging out on the Polars Discord server, which if you're interested in Polars, I can recommend you join. There's a link on, if you go to polarsguide.com on the homepage, there's a link to the Discord server. They're always very quick to help and to come up with answers to your questions.

Career advice: ask for things

I think we already covered some of that. Which is work out loud and work with the people or interact with the people you want to be colleagues with. And then my third advice would be, well, my meta advice would be to take all this advice with a grain of salt. But then coming back, my advice would be to ask for things. Just ask. And you'll be surprised how often people are willing to help you.

So the forward of data science command line was written by Tim O'Reilly. That's right. The Tim O'Reilly. Yeah. Do you think he came up to me and suggested he write the forward? I asked him. I asked him. And I was comfortable, kind of a little bit comfortable, not so much actually, doing so because I had interacted with him before about the command line. And he actually is one of four co-authors of Unix Power Tools. There's this thousand page book, which is way too heavy. And so I knew he really had an interest still in the command line.

The point is, is that I asked him. And if you go look on, again, polarsguide.com, there's a page with praise. All those people, all right, don't tell this to anyone else, but all these people I reached out to and asked if they wanted to say something. Well, of course, first read the book. And then if they liked the book, if they wanted to say something nice, right. And those are the two examples that I can think of right now. But there are so many more that if I didn't ask, then something also wouldn't happen. So that is my third and final advice when it comes to your career.

My advice would be to ask for things. Just ask. And you'll be surprised how often people are willing to help you.

I love it, Jeroen. Thank you so much. And I will echo that. You don't get what you don't ask for. I think a lot of times in my past lives, I have assumed that my work would speak for itself, which is absolutely not true. I see a lot of nodding heads. Your work is not going to speak for you. You've got to do it.

Next week we have Jay Timmerman, Head of Data Science and AI Platforms at Biogen. So if you are up for a insightful conversation about actual real-life practical use cases of AI in a big company, come hang out with us next week. I can't wait to see you. Same time, same place. Tell your friends. Thanks for hanging out. Thank you, Jeroen.

Thank you. Thank you, Libby. Thank you to everybody else. And I just wanted to say that if someone still has a question that we didn't get to answer in this show, if you Google for my name, you'll be able to find me. And feel free to reach out to me and I'll do my best to respond.