Data Science Hangout | Ivonne Carrillo Domínguez, Bixal | Transitioning to data engineering

Transcript#

This transcript was generated automatically and may contain errors.

Hi everybody, welcome to the Data Science Hangout. If you're joining for the first time today, I'm Rachel, and it's great to meet you. If this is your first Hangout, this is an open space for the whole data science community to connect and chat about data science leadership, questions you're facing, and what's going on in the world of data science.

The sessions are recorded and shared to YouTube as well as the RStudio Data Science Hangout site, so you can always go back and re-watch or find helpful resources too. We also have a LinkedIn group that Tyler or Hannah will share in just a second here in the chat. If you ever want to continue a certain discussion or meet people from today as well, together we're all dedicated to creating this welcoming environment for everybody.

We love when everybody can participate in these and we can hear from everyone. So there's three ways that you can ask questions today. You can jump in by raising your hand on Zoom and I can call on you. You can put questions into the Zoom chat and feel free to just put a little star next to your question if you wanted me to read it instead, if it's maybe loud where you are or something. But we also have a Slido link where you can ask questions anonymously as well.

Just like to reiterate, we love to hear from everybody, no matter your level of experience or area of work as well. And all of these questions asked today are up to you.

I'm so excited to be joined by my co-host for today, Yvonne Carrillo-Dominguez, Data Engineering Manager at Vixel. Yvonne, I'd love to just start by having you introduce yourself and maybe sharing a little bit about your work and the company that you work at as well.

Sure. Thank you. Thank you for inviting me to hang out with you. I was feeling a little bit of imposter syndrome because I was like, I'm not a data scientist. I don't know what I can contribute, but I hope this is helpful and productive for you all.

And well, about my role, I am Data Engineering Manager, as Rachel said, and my role is similar to the manager roles that you will see in small companies because we are a small company and also the data team is still small and we are growing it little by little. And what that means for me is that I am leading a team and also mentoring them, but I'm also still in some contracts, I will work as an individual contributor as well. So I am a data engineer in one of the contracts, while at the same time leading a team of data analysts, engineers, and scientists, so it's like all together. So that's in a nutshell, or in a high level, what I do here at Big Cell.

Well, about Big Cell, we are a company that we do consultancy work for the federal government, so most of our contracts are for the government. There are opportunities as well for the private sector, and I think that's where we are leading right now, but yeah, the kind of work that we do is digital services for learning, for marketing, communications, the technology part that is the biggest part, while we do content systems and web applications, and then, of course, the data team. Yeah, that's what we do.

With those restrictions, this is what we can deliver.

It's a really helpful phrase, too. With those restrictions, this is what we can deliver. It looks like in the chat there are a lot of people that are experiencing this as well and are saying they feel your pain, Justin, so I'd love to stay on this topic a little bit longer if other people have gone through this and have other things to share as well.

Not to call on anyone, I know David, you mentioned that you've had to do this and Evan. Sorry, Rachel, I was trying to multitask here while you were asking that question, but yeah, definitely. For those who don't know, I work at NASA, and we've been trying to implement a lot of different tools, and it's been a long, arduous journey over the last three or four years just trying to get an infrastructure set up, so it's definitely not an easy task within the government at times, and I'm sure the industry is the same way. But we've been able to do a lot of things, there's always a group that wants to use no-code, low-code type tools and just buy a vendor, which for some of the organizations in our agency that have the funding, maybe that's easy to do because they can spend that kind of money on a tool such as Alteryx, which costs us $5,000 a user per year, but my organization with little funds, we need to be able to do things with coding, so it's definitely just a matter of keeping the conversation going and showing the value of open-source technologies and other resources rather than some of these no-code, low-code type tools, so it's not easy, but it can be done.

I see Jamie, you have your hand raised. You want to jump in? Sure, yeah, just a brief note, two minutes to add to that. Also coming from a public setting, I'm in Toronto with a hospital group. We've recently been exploring some more container-based tech stacks, largely for this reason, to give sort of the flexibility that some of our engineers and scientists may want, although at other expenses too, like if everyone's using their own language and own frameworks, it can become unmaintainable as well. Other quick note, I've been kind of surprised by my own hiring, how much a tech stack and a job description can influence what kind of candidates you receive too. I was trying really hard to hire a developer at one point and just changing the tech stack to something that was more appealing to industry at the time completely changed how kind of candidates we had.

That's a very good point, like hiring IT people is hard and when you are in a public sector, it's just harder for sure.

Hardest recurring problems in data work

I see there are a few other questions coming through on Slido as well and Yvonne, what are some of the hardest problems you find yourself trying to solve over and over again, where it feels like we don't have any good solutions available yet?

Oh, that's an interesting question. I don't know, something that I've seen a lot is a lot of agencies needed that data pipeline and so they can consolidate all the information in one place and while I know that a lot of agencies have moved on to the next stage, that's something that was very repetitive and I think the challenge was exactly what we were talking about, how you can make this data pipeline with those limitations. So, for example, sometimes the client doesn't want to go to the cloud, so you need to do it, but they want it somehow to be optimized and maybe not automated, but the process needs to be optimized and some of the parts are automated, so that was very challenging and it's not an isolated case, like that happens a lot, but at the end of the day, there is a solution for that, we can just work around that.

Dealing with technical debt

Thank you very much. I see Sam, you asked a question in the chat, I think it has a star there, so I'll read it, but it's have you worked on projects where you inherited technical debt and how have you tackled that to obtain a more future-proof solution?

Yeah, sometimes there are projects that are just too good to pass on, but it is like somebody started it, so you are inheriting it and then when you see the code, you are going to see a lot of to-do's in there and it's like, okay, good, it makes sense. Also, we have done work that needs to be done from scratch and given the circumstances, there are some unknowns, so I understand that part, but yeah, I think the challenge sometimes is to know why they took some decisions on why they are doing this tool, why there are those to-do's. Sometimes, they don't make sense with the final infrastructure or the whole solution and yeah, I guess it's just tackle one by one, because there's always going to be that technical debt.

I think in one of the contracts that I remember that was very successful at that is that we had some overlap with the previous contractor, so they were able to tell us exactly what that tech debt was there. These are the things that we didn't have time to do or that, because of priorities, we never got to this, and then we were able to have all those in tickets and then prioritize that. From what we heard, what they shared, this is what we need to work on right now before we jump into the actual work, and the rest is going to be just little by little, and some of those tickets or things are not going to apply anymore sometimes, because maybe we changed to another technology that we found was more cost-effective. Sometimes, we are inheriting projects that are maybe not super old, but yeah, maybe ten years, and ten years ago, the technology was very different, so yeah, this will be obviously more or better tools that we can just change, so yeah, some of the tech debt is not going to apply anymore, but I think if we have that opportunity to ask the previous contractor, we will have that and then prioritize. If not, then at the beginning, when we try to host that or to replicate the infrastructure is when we are going to see the biggest things that we need to tackle, and the hidden ones are those to-dos in the code, because you don't see it right away, but one thing that you could do is just try to find them. Just look at all the to-dos in the code and see if you can identify one of those that are really big or important. You're not going to, in some cases, maybe most of them, or you're not going to be able to tackle all of them. Sometimes, this is really huge, but yeah, I guess just prioritize and focus on the big ones.

Yvonne's journey from generalist to data engineer

Thank you. I'd love to hear a little bit more about your journey as well, and you're jumping from generalist software engineer to specialized data engineer and what that was like.

Yeah, so I graduated as a computer systems engineer, and at the beginning, I didn't know what to do. I didn't know what the industry was offering to begin with, so I was just saying yes to any offer, like whatever life was throwing at me, and so I had the opportunity to have very different jobs, so at the beginning, I was a front-end developer, so I was migrating an application. I will sound very old, but yeah, that was a standalone application in a mainframe, so with green screens and that, I was migrating that to the web, and so I learned some web development that later on didn't help me a lot, because I was using technology that is not used anymore, but I was using Java, and Java really helped me out later. From that, I jumped to be a tester, so I was doing software validation for aircraft engines, so a whole new thing, and then I jumped into the retail sector doing back-end for an operating system, so there, I was using Java as well, and some C or C++, so it was more like low-level programming, and that helped me a lot with my programming skills, because that's where, since it was a very low level, and it was like programming like the signals that the keyboard sends to the operating system, things like that, you need to use a lot of like linked lists, arrays, vectors, so all those data structures that I was like, I'm not going to use this when I work. Well, I was using all of those, but that helped me with my coding skills.

Then, all of that was in Mexico, and I also was a data engineer in Mexico. That was my last job, and then I moved to the United States, and I, at that point, I knew I wanted to work with data, because I really liked that. It wasn't like as low-level as operating system, and but still, I got to use those coding skills to be parsing data, so I really liked that, but when I was here, I knew that I needed to get any opportunity that I have, and I was, I learned some web application, web development frameworks like Ruby on Rails, PHP on Django, and PHP with Drupal, and Go, like different stacks there, and so I felt that I was like moving around. I also was in the infrastructure team for a while, so I was learning a lot, and that was helping me to get or to be put into contracts, because I was able to do a lot of stuff. Maybe I wasn't very knowledgeable in anything, but I knew a lot. I mean, it was very wide, not deep, but I really wanted to go, so one, go back to data, and two, start specializing in something. I felt like, okay, I enjoyed all that, like learning all those skills, but I was seeing other people that they really know a lot of what they know, so I wanted to be like that, so whenever I was able to be in data contracts, because I wanted to to keep working for the federal government, so to do that, you need some clearances, especially when you are seeing data, and to do that, I needed to be a US resident, so finally, I became one, and I was able to move to the data field again, and that's when I started specializing more, and yeah, I really like that, because there's just too much to learn in all these fields, like infrastructure, front-end, back-end, and it can get crazy sometimes, and just in data, there's a lot to learn as well, but at least I'm more focused now on what I want to do or where I want to go.

Roadmap to data science

I see Victor, you had asked a question in the chat. It was, what should be the roadmap to data science based on your experience? And Victor, I'm curious of what perspective that's coming from, like just starting to learn data science or which role you're in now?

I don't know if I really can answer that, but I will tell any person that wants to go to data science is, well, first find a mentor that can really help you or guide you in that part, but as a data engineer, I will encourage you to use cloud tools to learn that part. I know that data engineering may not be the career path that you want to go, but you will always need some data engineering skills, especially if you want to show something or to prototype something, and you can be able to do that yourself. Just small projects, you don't have to start with big data, but obviously, if the scope of the project is huge, you will need a lot of teams. You need data engineers, infrastructure engineers, but just to know what all the teams do, I think that's going to be very helpful when you want to collaborate with others, so it will help you to collaborate and also to do your research or analysis yourself easier, because you won't depend on anyone, so you can start using some tools, like, for example, if you want to use AWS, you can start using S3 for storage and maybe start playing around with Athena or with Glue for transforming the data, and then when you want that, so you want to show a proof of concept, maybe someone can take that and scale that, and yeah, I think as a data scientist, you still need to clean your data anyway, so I think that's a part that I can't talk about. I just really encourage you to grow also in your data engineering skills. That will make you a more solid data scientist, maybe at the beginning, you need to focus on other skills, but if you want to jump into the industry, you will always need to do something in a small scale.

Learning new programming languages

Yeah, for sure. Hi, everybody. My question is around the languages. You talked a lot about a variety of languages that you've worked with, and languages are similar and dissimilar in different ways. Do you have a framework when jumping between the different languages, and if so, what is the first thing that you look at when you're looking at a new language that you're going to use, or a new framework within AWS, for example? What are you looking for?

Yeah, that's a really good question. When I need to learn a new language, the first thing that I want to see is the data types. They can change slightly between languages, but that's the first thing that you need to work with, with variables. So, if you have a text, it's going to be a string or it's going to be another name. If you have a boolean, it's going to be just bool or boolean, just to know the data types, and then the flow, the syntax to do those flows, like the if, for, while, or switch case, or other. I forgot the name of what those are called, but yeah, how to do those flows, because yeah, as well, they change between languages, and they have slide changes sometimes. It's for, and then the name of your temporary variable, and then colon, and the list, and sometimes it's between parentheses, and you need to do it all the way around. Just knowing that can let you start with that. So, at the beginning, my code is super basic. It will have a lot of if and fors, and maybe that's all, but little by little, I started learning more of the ways of optimizing the code, but I think that's the very, very first thing that I do or try to learn, and one of the things that you also need to do, especially in data, is just open a file, open a CSV, open an Excel file, open a JSON file. So, that's something that I will do next. So, how am I going to read a JSON file? So, I try to learn that. So, what is the code that I will use? If it is Python, for example, you will do it with Pandas. They're just like a very easy function to do that, and that brings me to another thing.

Sometimes the language is not specialized to handle data. For example, Python with R, like that's specifically for doing statistics and data, so it's very easy to use for that, but Python wasn't created for data. So, that's another thing that you will need to do, like what are the libraries that are used for data analysis? So, okay, it is Python, and then you will say, okay, there's NumPy, there's Pandas, and then try to see the documentation and what are the things that you can do with the data. So, that will be maybe the last thing I will do, and then it's just practice, just practice and trying to get better at it.

Tech stack and data visualization preferences

Thank you. I see there are a few Slido questions that touch upon a similar idea, and one was asking about what your tech stack looks like, and then the other is around preferences for data visualization and why.

Sure. So, for tech stack, it depends on the contract I am working on. So, yeah, I don't have like one specific one. In Vixel, we try to push to use AWS, and we have this, the infrastructure, we did like an environment that we can reproduce, so we are using Terraform, so that's why we like that. So, if we know that we are going to need some extra buckets, maybe a way to transform the data with AWS Glue or Lambda and then another storage, we can deploy that infrastructure very quickly because we have that already in Terraform. So, that's what we try to use, and also it has all the requirements that some federal contracts require that are like the security compliances like FedRAMP, for example. So, if we are free to use any tech stack, we are going to use AWS for sure, and for doing the ETL or the extract and form load process, we tend to use Python just because it's easier to deploy that to AWS, and it's always easier to find people that know Python. At some point, we were leaning more into R, but it was hard to find people. It was harder to, especially like two years ago, it was harder to use it in the cloud. Right now, it is getting better.

For the data visualization tools, well, I like Shiny . I've worked with Shiny on RStudio, and it's just very fun. It's very fun to work with and trying to make your charts and visualizations prettier and interactive. I really enjoyed working with that one, so I recommend that one just because it's fun, because I really like it, and you can customize it as much as you need. Sometimes, when you're using a tool like Power BI or Tableau and then the customer is like, oh, I want to have my branding and my theming, sometimes it's not as easy to customize it as much as the client wants, but it also depends on the time or the resources that you have. If you have limited time, sometimes Tableau is the easiest way to go, and also it's the one that the client knows the most, so if we can do it our way, maybe we're going to go, and we have unlimited time, we will need to decide between doing it with Shiny or maybe a React application. It also depends on who is going to do it, but my favorite one, it will be Shiny and Tableau.

Thank you very much. One other follow-up question I think someone had asked around sharing APIs with inside and outside of the organization. Is there a certain tool that you use for that? I'm not sure I understand the question.

Let me double-check. AJ, I see you asked that. Do you want to jump in? Yeah, so a lot of times, you have great tables , logic built into tables, your suppliers, or people within cross-functionally want to get their hands on that data, but they're huge files, right? They're tables that get refreshed every so often, so is there a way you prefer sharing those big data files internally or externally, like through API tools? I don't know if AWS has those technology.

Yeah, so yeah, that's also a good question. We've been in projects where it's just easy to push that data to GitHub or GitLab, but that's not really best practice. Maybe that's something you could do when you are in the prototyping phase, especially if it is private data that you cannot share, that you shouldn't share that, like never in GitHub, but we have used two options there, so just upload the data into S3 and give access to the people that need to have access, and with S3, that's a good tool because it is reliable and it's scalable, and it's scalable, so you can just upload as much data as you want, and you can also restrict the data depending on what people can see, so you can assign roles to the people and they will only see what they need to see, so that's a good solution.

And another one that we have used, and this is if people that are not very technical are going to be participating, we use SharePoint because we use a lot of Microsoft tools, so we have Teams and Outlook, so a lot of people are already familiar, I mean, within Bixel, they're already familiar with SharePoint, so that's a tool that we are going to use, and you can also restrict the data. I think we will prefer S3, but depending on the people that we are going to be collaborating with, maybe we are going to be using SharePoint, and then when you are setting up your environment, we know where to go to download the data, and if it is S3, you can just call it from your code. If you have access, you just need your AWS keys and you can just programmatically download that data. If it is SharePoint, I think there's a way to do it. We haven't used those, but that's something to explore, and also, we use a lot of web applications. We use Drupal, which is a PHP framework for content management, and if we need to download data from there or to use content that is in there, if the web application is within the company, so Bixel developers are maintaining it, we can ask them to create an API for us, so Drupal has the ability to enable the API, and then we can use it, or we can request them exactly what data we need, and they will implement that endpoint for us, and then we just use it. So, I think those are the three options that we use when we need to share information internally.

Data engineering experience requirements when hiring

And Jorge, I see you asked a question in the Zoom chat around data engineering positions. Do you want to jump in and ask that? Yeah, sure. Yeah, yeah, that's fine. Sorry. So, the question is regarding just when I was transitioning from developer to working in machine learning, I came across a lot of positions that were in data engineering, and oftentimes, when I had the conversation with the – sorry, I think my camera's locked. Oftentimes, when I had a conversation with a recruiter, as soon as they heard that I didn't have experience in, like, Spark or any of the other distributed computing systems, MapReduce, all of those, I only covered it in grad school. I didn't use it in, like, any job, so they just immediately lost interest, and they said, no, we just need interest – you know, we need people with experience in those, and that was it, right? So, do you find experience in those frameworks necessary to be a data engineer?

So, you will learn that, but to – when I am hiring people for a data engineer position, I focus more on the programming skills. If you are a good engineer, you will learn Spark or MapReduce in no time. Like, seriously, it's not – I wouldn't want to lose a good engineer because of one skill. Yeah, no, I don't think – I mean, it is necessary in terms of you will end up using them, but in my opinion, I wouldn't say no to a person that I know is good, and I can teach them or mentor them in these frameworks. When I started, like, my first job as a data engineer, I didn't know any of those. I didn't know what big data was or MapReduce or Spark. I had a lot of experience with Java. That's why they hired me, because they were like, okay, she knows Java. She knows how to program. That's more important, and I think I agree with that, and that's the philosophy that I still follow up to this moment, and so, yeah, programming skills, that's basic. You really need to know how to use code, and if it is one programming language that it is more common in data, like Java or Scala, Python or R, it's going to be better. If your programming language is more like PHP or Ruby, those are good. I

Data Science Hangout | Ivonne Carrillo Domínguez, Bixal | Transitioning to data engineering

Transcript#

Excitement about the data space

Getting to know Yvonne

Balancing individual contributor and manager roles

Does an engineering manager need a technical background?

Benefits of being a data professional when talking to senior leadership

Planning the workday

Navigating tool constraints in government work

Hardest recurring problems in data work

Dealing with technical debt

Yvonne's journey from generalist to data engineer

Roadmap to data science

Learning new programming languages

Tech stack and data visualization preferences

Data engineering experience requirements when hiring

Featured software#

rstudio

Data Science Hangout | Ivonne Carrillo Domínguez, Bixal | Transitioning to data engineering

Transcript#

Excitement about the data space

Getting to know Yvonne

Balancing individual contributor and manager roles

Does an engineering manager need a technical background?

Benefits of being a data professional when talking to senior leadership

Planning the workday

Navigating tool constraints in government work

Hardest recurring problems in data work

Dealing with technical debt

Yvonne's journey from generalist to data engineer

Roadmap to data science

Learning new programming languages

Tech stack and data visualization preferences

Sharing data internally and externally

Data engineering experience requirements when hiring

Featured software#

rstudio