Resources

Getting the Most Out of Git - posit::conf(2023)

Presented by Colin Gillespie Did you believe that Git will solve all of your data science worries? Instead, you've been plunged HEAD~1 first into merging (or is that rebasing?) chaos. Issues are ignored, branches are everywhere, main never works, and no one really knows who owns the repository. Don't worry! There are ways to escape this pit of despair. Over the last few years, we've worked with many data science teams. During this time, we've spotted common patterns and also common pitfalls. While one size does not fit all, there are golden rules that should be followed. At the end of this talk, you'll understand the processes other data science teams implement to make Git work for them. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Getting %$!@ done: productive workflows for data science. Session Code: TALK-1091 -------------------------- Thumbnail from happygitwithr.com, still from Heaven King video

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Excellent. Right. Thank you very much for coming along. So, as introduced, my name's Colin Gillespie. I'm one of the co-founders of Jumping Rivers. A few years ago, I got roped into writing a book. The way that happens is publishers say you can write a book and you've got 18 months, and you go, yeah, I'll do that, because in 18 months' time I'll have lots of time. 18 months comes around flipping quickly.

Jumping Rivers, clearly the name tells you everything about us, but just in case it doesn't, we do lots of data science, all the data science stuff you think of. You know, R, Python, shiny apps. We do lots of machine learning, managed posit services. Come and have a chat round by the exhibitors. We've got a stall there.

The problem with Git workflows

Right. What are we talking about? So, we're talking about Git. So, you've used Git. I'm sure you've all used Git before. You've got that tortured look on your face. So, if you've used Git, you've probably came across this book, right? So, Jenny was in the last talk. She's popular. Writes excellent packages. She also writes excellent books. So, if you've not seen this book, have a look. It tells you how to be happy with Git and GitHub.

I suppose we could all go home now, couldn't we? Because, well, if Jenny says it's true. And not to go against Jenny, but I sometimes get things like this, you know. Detached head. Does anyone actually know what detached head means? Right. Even though the error message is very clear, you're in detached head. What? And so, we've got this. And then Git thinks, well, you're mocking me. So, now we're just going to reject you. You're just not getting to push. So, it's like, doesn't seem to be in a very happy place.

So, clearly, we're doing it wrong. Because that's always the issue, isn't it? You know, we're doing it wrong. You talk to someone who knows Git, and it's like, ah, you're doing it wrong. And so, it's probably a Git workflow. Okay? So, you know, we've got the wrong sort of workflow, which is why we're getting rejected and why we've got detached head, whatever that may mean.

So, you might think, well, let's think about a workflow. This came from an actual Stack Overflow question, so I'm not just making this up. So, it must be correct. So, you might have the main branch, right? So, these little orange dots are your commit messages. Okay? So, you go along, you add little commits. You can even imagine a little happy face in each of these little commits because we're good.

But we're going along, and then, well, every so often, we have a bug in our code. Not me, obviously, but other people have bugs in their code. But we may have a little bug in their code, and so we do some bugs. There's going to be a typo in the next slide, it's guaranteed. We'll have a little bug in the code. And that's fine. It's not too complicated. I think that's okay. We can deal with that.

But then we really need a dev branch because we're proper software developers. So, we want a nice dev branch because, you know, that feels good. And we've probably got some sort of release branch as well because we're really doing it properly. So, now we've got a release branch that tells us when things are released. And we've got a dev branch because that's showing how we're updating. And we've got these hot fixes, and that's okay. And then someone comes along and goes, but we want to add another feature. So, okay, we've got another feature. And we're working on another team, so we've now got another feature branch. And it just gets absolutely chaotic.

And I'm not making this up. I took this picture from a Stack Overflow question saying, this is our workflow. Does this sound sensible? And the answer was, yes, that looks really good.

How important is your code?

And so, I know what you're thinking, right? I can look around and I can see it in your face. I know what you're thinking. Because I thought the same, right? I really did. Can we trust Jenny Bryan when it comes down to it? Is Jenny in the room? Please, no. Can we trust Jenny? So, this talk is now going to be about this. Can we trust Jenny Bryan talk, right?

So, I'm going to ask you a very personal question. And I'm not expecting an answer. You just sort of keep it inside. How important is your code? Are you on the left? And if we're honest with yourself, no one really cares. Maybe not even yourself. And that's a really sad place to be. But, you know, no one really cares, right? And that's sad. Even worse is lots of people care, right? You know, something goes wrong. Your boss shouts at you. Somebody loses money. You know, something bad happens.

Not quite sure which side of the screen I want to be on. But, you know, we've got a line, right? You know, so it's no one cares. Pain and suffering at one end. So, this is a standard process. So, we've got a data scientist who's also an extra from Downton Abbey. And he likes doing codes. So, he pushes straight to get. Pushes straight to main. And then he's got some magical wand. And magic happens. And things just sort of spring up somewhere. You know, who knows how it happens. It's just, you know, no one really cares. It just sort of works-ish.

And this is usually bad. But not always, right? You know, my part of the talk is not for you to walk away and go, this is what you must do. Part of the talk is, it depends. So, I do this all the time. If I'm doing slides in Quarto R Markdown, I'm giving a talk once. I just want it in Git. I want it, you know, it should be reproducible. But I'm only giving that talk once. And it's okay to do that, right? It's okay. Or you may be getting started in a project. You know, you're just sort of messing about. And you just want to get something up and running. And Git's quite a nice place if you drop your laptop and you're moving computers. That's okay, right? But typically it's bad. But there's a few exceptions.

The GitHub workflow: keep it simple

So, maybe someone does care, right? You're now feeling a little bit more love, right? You can give yourself a little hug. So, you've got someone care. And so, I'm sort of thinking about, you're writing stuff and people use it. But if you break their code, you don't really care. You don't say that out loud, by the way. But if you break their code, not really your problem if it comes down to it, right? You know, you may say, oh, I'm really sorry. I'll prioritize that. But it's not the end of the world.

So, what should you do here? Well, I'm going to say you should use branches, right? And you think, you've lied to us. You've just said this is terrible. Well, no, no, no, no, no. Don't do this, right? This is bad, right? Okay, I'll rephrase that. It's bad for most of the stuff that we do. And I'm going to say for 99%, because it's a good statistic made literally in the spur of the moment, 99% of the room doesn't need to do this, right? It's not us.

Keep it simple, right? You know, you've got the main branch. You create a feature. You do some commits. You merge it back into the main. Feature disappears. You create a feature. You merge it back into main. You create a feature. You merge it back into main. It's called the GitHub workflow, right? You know, it's nice and simple. It's not that hard. There's nothing particularly fancy. It just sort of works.

And so what I would suggest here is, you know, well, first of all, you should protect your main branch, right? So protecting your main branch just means you can't push to it, right? You know, you have to go through this branching. And people always sit here and nod, but when it comes down to actually locking down the main branch, they don't like that. But that's a side problem.

And then have some basic continuous integration, right? So continuous integration is when you push something up to Git, so GitHub or GitLab, and a little computer in the sky does some work for you, right? So it sort of chugs away. And if you pass all its tests, you hear nothing. And if you don't pass, it sends you nasty messages saying you failed. And the sort of test could be, does your Shiny app run? I'm not talking about 100-unit tests. I'm just saying, does it run? If you run run app, will it actually just run the app, right? Nothing particularly fancy. Can you build your package? Does your package just sort of build? Right, it's as simple as that. Does your Quarto doc compile? Right, so something relatively simple, relatively straightforward, that just takes that little bit of pain and suffering away.

Keep it simple, right? You know, you've got the main branch. You create a feature. You do some commits. You merge it back into the main. It's called the GitHub workflow, right? You know, it's nice and simple. It's not that hard. It just sort of works.

So that's what I'd say to start. So here, a little Downton Abbey guy. Well, he's now tries to push to main, and he gets a very sad face. So he's not very happy now. But he can make a little dev branch. So here I'm pretending it's a little Shiny app. So I've made a little drop-down menu. I push to a branch. And then a CI passes. That's good. It can get merged into main.

And when you're doing this in GitHub and GitLab, you can do things like if the CI passes, it automatically gets merged. Right, you can sort of do that sort of stuff. So you don't have to sort of go to GitHub and then start clicking lots of buttons, because if you're clicking lots of buttons, you're probably doing it wrong. But you can set it up so you can actually push from the command line. It creates that branch. It runs a CI. If the CI passes, it automatically merges. And everything just disappears. So you're sort of still at that nice, simple step. But you've got that safety net of not shooting yourself and then digging a big hole for yourself and then throwing it in head first. You're a bit more safe.

Code owners and scheduled CI

So let's suppose we're getting a bit more important. So we've got a larger team. So you might have two, three, four, five people. So there's more than one developer. And we're wanting to keep track of who owns what.

And we're also, so I used to work in academia. And I used to work with a professor writing a paper. And we used to use Git. And his idea of using Git, he would phone me and say, right, I'm about to do some commits. And then he would hang up the phone, and he would commit. He was more senior to me, so you just have to agree. He was like, yes, OK. So that was our Git workflow, was a phone call. Not recommending that workflow in case anyone's not paying attention. Don't do the phoning part.

So keeping track of who owns what. So there's a code owner file. This is a really simple idea. And I think lots of your Git repos should have it. So it's a very simple text file. GitHub and GitLab understand this natively. So that means that GitHub actually understands what this file does and means and all that sort of thing. You basically create a little file called .github slash code owners. Just Google it. And it looks something like this. So star just means the group notes admin, so that would be a group of users, can do what they want. They can merge, join, and all that sort of stuff. And this person, Amy, has also got those superpowers. Then we've got another directory. So website might be a directory inside my Git repo. And website admins and Tim is only allowed to change that website, but they're not allowed to change other stuff. There's a few more options, but keep it simple.

And this is great because now we can start locking down this workflow. And also, if you're working in a team, so we deal a lot with governments and those sorts of organizations where people move around teams. Because this is a file, it's a programmatic file, you can now write queries such as, what Git repos is Tim involved in? So Tim is moving from organization A to over here. So before Tim leaves, we can say, Tim, what repos are you involved in? And we've now got a programmatic way of pulling that information out. And then we can rearrange responsibilities. And that's nice.

And so essentially, we've got exactly the same workflow, except we've got code owners who are people doing the merging part. Not that hard, quite simple.

Next one, so we're moving along this line. Scheduled CI. So you've got your CI up and running. So scheduled CI, hopefully you can figure this part out. It's a CI that's scheduled. So what we do is we've got a whole bunch of internal R packages, probably like most of the people here. And they're built once a month. It's on the 10th, and I've got them scheduled to run from sometime between midnight and 8 o'clock. So when things break, you just get all these random messages through this period of eight hours. And so we've got internal R packages. Any errors are sent to a Slack channel. So everything passes. So some months, not a problem. Other months, things start to fail.

So in the 10th of every month, we get all these checks. And then crucially, we assign someone at Jumping Rivers to fix that. We give them actual time in their workload. We say three hours a month, you've got to fix anything that comes up. Some months, not a problem. They don't have anything, so don't know what to do. Go to the pub or something. Other months, it's a bit more painful. But there's that time where things are just kept routine.

Also, a useful thing is when something breaks, we then have to think about, do we care about this? Or should we just get rid of this stuff? It's very good for doing legacy software. If something breaks, and you actually have to fix a silly thing, you think a lot more carefully of, does anyone actually use it? Do I really care about this? Or could I just archive it? So that's also just a nice way. Especially, I think, in many organizations, it's quite easy to create this stuff. And you look a year back, and you've just got stuff everywhere. It keeps you honest.

Continuous deployment and pedantic CI

Right. Next thing. So we're moving on. Continuous deployment. That tends to go along with continuous integration. So here, what I'm wanting is I've created a new feature. And when I create that new feature and push that to Git, I want something deployed without me having to press buttons. I want magic to happen. Okay? So what I'm meaning here is we've got the setup as before. Code owner, can't push to main. When I push this new drop-down menu to my app, I automatically launch that Shiny app with that new feature.

So I'm pushing to the branch. I've created this branch. And the Shiny app is automatically created for me. And then that means that my wonderful code owner can then go in, can review the code, and can also start clicking on things in the Shiny app. You're taking that little bit of effort away from them. And then if everything passes, the code owner is happy, it passes the CI, then it would automatically deploy onto the production server. So no one is having to press deploy, go, move forward. And then when I've merged that, the dev app would just sort of magically disappear without anyone having to worry about it.

This part is starting to take a bit more infrastructure. This part is starting to take a bit more maintenance. Doing a code owner's file is dead easy. It's ten minutes of Googling, five minutes of writing. This stuff you're having to think about, where you're launching a Shiny app, for example, it's going to connect or it's going to wherever you want. And then you have to try and destroy it when that branch is then merged. And so it does get a little bit more tricky. But it's a really nice workflow. And that's what we do internally. So it works really well.

The last part, and I'm just going to just sort of lump this into sort of pedantic CI. And this is just sort of wherever you want to go. Full honesty, I did this one first seven years ago and really annoyed the whole company. But that's a side problem. So a news file, you can have a little CI job that checks as a news file formatted properly. Does it have, has it been updated? So one of our CI jobs is whenever you update the package, it checks that the package version has been updated. So you've made a change to the package. So the version must be updated. Otherwise installed packages just breaks. So it checks that. And it goes, if you've updated the version number, have you updated the news file? Because it's really easy to forget this stuff. It's a little CI job that just checks this. So it checks as a description tidy as well. It does other things like our file names, lowercase.

You could, we could do some stuff and commit messages, talk about later. But the world's your oyster, right? You could start thinking about, actually, we've got 50% unit test coverage. If you make any changes, it must be at least 50% or greater unit test. You're not allowed to degrade that experience. So you can start adding on all these hoops to go through.

This stuff here starts to make it hard to onboard team members onto. All right? So all this stuff will say, well, you can't just get someone in jumping to start doing commits because you've got all this stuff to sort of make things proper. But it depends where you are in that line. All right? You know, if you're making software, the effort breaks and lots of people are going to be upset, then, well, that's what you have to do. You don't want to mistake.

So, I mean, it's where you are in that line. Something that, you know, so commit lint, that's not an R package, by the way. So if you Google it, it's a, it's an additional sort of add on. Essentially, whenever you do a commit, it checks your commit message. It follows particular standards. Right? So the first part, for example, we have a bunch of keywords. So it might be chore, fix, feature, CI, docs, you know, those sorts of words. If you do anything other than those words, you'll get a little shifty message. They don't do some stuff in the right hand side. So does it start with a capital letter? Is it too long? Is it too short? So again, it does that.

This can be really annoying, right? This works well when you've got this mythical place, which I always live in, where your first commit just works. Because your first commit you do, I am feature, you know, adding a new dev, a new dropdown menu. But then you've got another 15 commits after that in order to fix a thing that didn't quite work. And so I tend to then write, fix, bump, fix, bump, and then you're into rebasing and all that sort of stuff. So it can be really annoying, but it can also be useful. I really do mean it can be really annoying.

Summary

Right. So just to summarize, so if you're at the far left, and it's only you, who cares? Right. Basically, you know, use something simple, get stuff done. Right. But then as you sort of make your way along, you know, people are actually using your software, and that could be future you in a year's time. Right. I hate past me. He's a complete, right? Then you want to start thinking about protecting main, adding in CI, thinking about code owners, if there's more than one person, thinking about scheduled CI. Scheduled CI, again, five-minute job to set up. Right. You can stick in a slack hook, or stick in an email. Not that hard. You know, once you've got that, quite easy. Not so much fun when you get 15 slack messages, telling you all your packages are broken, but side point. Continuous deployment, absolutely wonderful, but that's not a five-minute job. That also, you know, from my experience, takes maintenance. It's not a sort of, you just do it, and then you walk away, because you're deploying it to a server, and there's authentication, and credentials, and stuff. Right. There's stuff there. Worthwhile, but there's stuff. And then after that, you can just, you know, go where you want.

So, I think we can trust Jenny Bryan. So, that's good. I'm safe. Not so sure she's right about set WD, if I'm perfectly honest. You know, so she's got quite strong opinions on set working direct. So, I'll think about that for next year. But, thank you very much for listening, and I hope you found it enjoyable.

We do have time for a little question. So, I do find that it is easiest to maintain good git ways in life, if there are multiple people working on a project. Often the exact opposite, but oh well. I find it hard to, to be my own, like, boss when it comes to that. Is there any, like, good ways when you're alone to like keep, keep the momentum going there?

So, something I didn't touch on is you can start templating this stuff. So, rather than constantly adding little GitHub YAML files with your CI part, your GitLab YAMLs, you can put it in one place and essentially template it all the way through. So, then when you're setting up this stuff, you're essentially just sort of pointing, say, use this stuff over here. So, then your CI, you know, you can have a very simple CI file, which is just use stuff over there. And then essentially the computer is a person keeping you on the straight and narrow because the computer will say no, you're not allowed to push, no matter how much you swear at the computer.

And then essentially the computer is a person keeping you on the straight and narrow because the computer will say no, you're not allowed to push, no matter how much you swear at the computer.

Okay. Fantastic. From the Burma Bihar, do you like Git? I do. But that's because everybody else is using it wrong now. It's just, yeah. But yeah, I do, but it's taken so long to understand what's going on. You know, it's taking a lot of time. Fantastic. Thank you, Colin.