KEYNOTE: Dr. Jeroen Janssens - Embrace the Unix Command Line and Supercharge Your PyData Workflow

Transcript#

This transcript was generated automatically and may contain errors.

Hello, everybody, and welcome to our very first keynote of PyData Global 2024. I am very excited to introduce our keynote speaker, Dr. Jeroen Janssens , and he is going to be sharing a talk with you, Embrace the Unix Command Line and Supercharge Your PyData Workflow. Thank you so much for being here today. Thank you so much for keynoting. The floor is all yours.

Thank you, Tamara, and thank you for joining me. Today, I'm going to talk about how you can embrace the Unix Command Line so that you can supercharge your PyData workflow. You see, I'm a huge fan of the Unix Command Line. I use it every single day for a variety of tasks, and I really believe that it makes me more productive and more efficient.

So, of course, I'm very enthusiastic about this, and whenever I tell a colleague or friend or some random stranger on the street that they should also start embracing the Unix Command Line, I'm often met with questions such as, what? Why? How? So, the time has come to answer all these questions and more, once and for all. So, today, I'm going to share with you every tip and trick that I know of that's going to help you embrace the Unix Command Line.

What is the command line?

So, let's start with the what. What is the Command Line? Here's what the Command Line looks like. On your system, it may look different. The Command Line is an interactive environment that allows you to run Command Line tools, right? If you know how to wield it and which spells to cast, then the Command Line can help you perform all sorts of magic. I mean tasks. But, hey, look at it. It's hideous. Sure, the penguin is cute, but don't be fooled. With a single command, you can wipe out your entire system. No wonder people are intimidated and reluctant to embrace it.

Still, I think you should embrace the Command Line. Says who? Well, says me. I started using Linux about 16 years ago when I was doing my PhD. Now, I wasn't quite ready to leave Windows, so I installed Ubuntu Linux next to it, known as Dual Boot. Back then, there was no Windows subsystem for Linux, aka WSL. After my PhD, I moved to New York City to work as a data scientist for a number of startups. I started using more and more tools for my daily job. And so, in 2013, I wrote a small blog post called Seven Command Line Tools for Data Science.

This hit the front page of Hacker News, and lots of people were commenting on this. So, I thought, like, hey, there might be something in here. Other people seem to be, you know, interested in this. And so, one thing led to another, and in 2014, I got the opportunity to write a book called Data Science at the Command Line. Now, in 2021, I wrote the second edition, which you can read for free on my website. This book, you know, also led me to start giving workshops, and eventually, I decided, all right, let's build a company out of this called Data Science Workshops. You know, and then thanks to the pandemic, this wasn't all that fun anymore, so I decided to join a company again called Xomnia, where I'm currently employed as a senior machine learning engineer. And that's actually the place where I came in touch with yet another interesting tool, Polars, which I'm not going to talk about today, but I just want to mention that I am currently writing a book together with Thijs Nieuwdaal about Python Polars, which should come out in February.

Why use the command line?

So, let's move on to the second question. Why? Why should you care about the Command Line? See, as researchers and developers, we have many, many tools at our disposal. Programming languages, IDEs, spreadsheets, pen and paper, and of course, the Command Line. In any case, I really believe that you should use the tool that gets the job done.

So, where does the Command Line fit into this? As this diagram tries to convey, I believe it hits a sweet spot between using the mouse and the programming language, right? For example, renaming or converting, as in this diagram, converting 100 JSON files to CSV, right? How would you do this? Try to think about this for a moment with the tools that you're currently using. Would you write a script for this, right? Perhaps a Python script for this. Is it worthwhile to do this for a one-off task? Or do you have some graphical application, some GUI that is able to convert CSV to JSON or maybe a website, right? That might work for one or two files, but what if you have, right, 100 of them? It gets tedious quite quickly. So, this is something that you can do very easily using the Command Line.

And another example is, and this is a little bit more complex, right? Let's say you have log files, over 10 million lines in this log file, and you need to know the top 10 errors in these logs, how would you do that, right? It's such a one-off task. And I'm, of course, going to try to convince you that the Command Line is really well suited for these kind of tasks.

Now, what other kind of tasks can we do? Let's have a look here. So, when it comes to general tasks, the Command Line is really close to the file system, yeah? So, it makes it really easy to work with files and directories, search through text. I'm sure that you've used the Command Line to pip install a package or two, or maybe even create a virtual environment, right? But the Command Line can do much, much more than that. If you're more in the operations side, it allows you to schedule jobs, configure servers, monitor your resources. Deploy software, right? Also, if you take a look at your CICD pipelines, right? If you peel away all those layers of YAML, then very often you find that it's actually a Command Line tool under the hood doing all the work.

Okay. So, but I think we all know what the real reason is, right? Why we want to use the Command Line, in which you look like a hacker. And then, of course, not talking about hacking into someone else's system. No, I am talking about solving a problem in a timely manner, right? This is the infamous data science Venn diagram by Drew Conway. And, you know, when it comes to hacking, I believe that the Command Line plays a very important role.

And if you don't believe me, then perhaps you believe this article in Nature. Nature argues that researchers should embrace the Command Line. They say that it can help wrangle with big files and that it allows you to parallelize your work and even automate it.

Nature argues that researchers should embrace the Command Line. They say that it can help wrangle with big files and that it allows you to parallelize your work and even automate it.

Now, I was convinced already 30 years ago when I saw Jurassic Park for the first time. And in this scene, Lex is able to close the doors because she knows Unix. Yeah? So, that the velociraptors don't get lunch that day. But there are plenty of other reasons.

So, one of the biggest reasons is perhaps that the Command Line has been around for, well, it says almost, but it should now be over 50 years, right? It's even older than I am. And because it's been around for so long, thanks to what is known as the Lindy effect, we can be pretty sure that it's going to be around for the rest of our careers. Yeah? And that's comforting to know. It means that the time that you invest now is going to be paying off for a very, very long time.

Yeah? Right. I already mentioned this. It is very close to the file system. Right? It's what we work with on a daily basis. With files and with data. And so, you can do most of these things using the mouse, right? But as I said earlier, the Command Line is much more suited for these one-off things. Of course, and I'll talk about that later as well, is that at a certain point, there is definitely benefit to using a programming language.

Now, here's another reason. I think that a lot of us have used, well, the Command Line not only for installing Python packages, but also for Git. Sure, there are graphical user interfaces for this. But when shit hits the fan, right? When things are really messed up, then it's, well, the Command Line is the only interface that allows you to clean up that mess, right? If you want to do it properly. Of course, you can throw it away and clone the repo again, right? There's these graphical user interfaces. There's also always some level of abstraction. And because of this, they have limitations and they cannot really scale as well.

I think that's, uh, that's quite the compliment, uh, from the original author of Apache Spark, and it's really interesting that they've decided to add, you know, this functionality to leverage a 50-year-old technology.

Key takeaways

Yeah. And, um, with that, here are just a couple of things that I would like you to take away from this session. And the first one is that the command line is here to stay. Yeah. It's not going anywhere, um, unlike the, the latest, uh, HIP, uh, uh, framework that you might be using. The command, the command line is going to be around. So, it is definitely worthwhile to invest your time into this. The claim method, right? All these steps that you can take, um, in order to really, you know, own or claim, um, become comfortable with this, this, uh, interactive environment. Yeah. And remember, you don't have to do all of those steps, or you don't have to do every step, and you don't have to start using every command line tool, uh, that there is. You can start small, take baby steps, um, and build it from there.

And then lastly, my advice is to be creative. Yeah. So, in two ways, first one is, okay, uh, try to see where you can fit in the command line in your daily tasks, right? Is there something small that you can do on the command line, but also creative in the sense in that you, you know, you can create your own tools. Think about how the code that you've already written in Python, how it can be turned into a command line tool. Are there certain things, certain static things that you can make, you know, variable, or is this something that can be reused either by yourself or maybe even by others?

Yeah. So, with that, um, well, should you have any remaining questions? Uh, I just want to say that I'm organizing a round table in a few moments so that we can continue the conversation. Um, and on my website, um, jeroenjansens.com, you can find my email address and all my socials. Should you want to get a hold of me after this, uh, this conference. For now, thank you very much for your attention and, uh, I wish you good luck in embracing the command line.

Fantastic. Thank you so much for that awesome keynote. Um, the chat was hopping. So, hopefully, attendees, you are more than welcome to join, um, the lounge, which you can access through the main reception area. Um, it should be your top left. That will give you a lounge space and, um, he will create a table and, um, you all can sit around the round table virtually and, and continue this awesome discussion. Thank you so much for kicking off, um, PyData Global Keynote. Um, it was such a pleasure and thank you so much for your time. Attendees, thank you for your time. We are so glad and so thankful that you are here and, um, we will see you in sessions later on today. Thank you.

Thank you. Take care.

KEYNOTE: Dr. Jeroen Janssens - Embrace the Unix Command Line and Supercharge Your PyData Workflow

Transcript#

What is the command line?

Why use the command line?

Command line tools in action

Creating your own command line tools

The CLAIM method for embracing the command line

When and where to use the command line

Key takeaways