Resources

Richard Vogg | Examples of simulated datasets that bring value to a data-driven company | RStudio

Full title: How I became a Data Composer – examples of simulated datasets that bring value to a data-driven company How can I get the buy-in from business partners to use more advanced techniques? What can I do to make a data project involving several teams more efficient? And how can I train analysts who do not (yet) have access to sensitive data? A good data composer is skilled at creating suitable data quickly and efficiently. R has many functions and packages that help with simulating independent variables and composing those in a meaningful way. In this talk, I will share how I started creating data and how this skill helped me with solving some of the issues described above. Showing a few examples – of small, medium-sized, and large data composition – I want to encourage attendees to simulate data and enrich their data skillset. About Richard: Richard Vogg studied mathematics at TU Kaiserslautern, Germany, where he focused on statistics and obtained a Master’s degree. He worked as a Senior Business Analyst at Evalueserve in Chile for the last years, analyzing data for a major US bank. At the end of 2020, he moved back to Germany. Richard is a fan of applied statistics and storytelling with data. Outside of R, he enjoys playing the ukulele, trumpet, and didgeridoo

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Richard Vogg and I'm happy to be here and talk about how I became a data composer. This ability of composing data or creating data from scratch has been increasingly helpful for me at my last job. I was working for a data analytics company in Chile for the last years and the client that we were supporting with analytics was one of the largest banks in the United States.

Similar to a music composer who assembles musical notes and chords into melodies and songs I started to play around with datasets. Using R it is very easy to get started, it is fun to get started as many of you might also have experienced. For example it is very easy to create a thousand random normally distributed variables and we can also easily create variables from other distributions like the exponential distribution or we can create categorical variables by using the sample function. We can combine all those variables in a data frame and then we have our first in this case nonsense dataset ready.

And then we can go and take a look at the data and we can check if what we composed is actually what we wanted to compose, if it makes sense.

Using data composition in business

So I started to create small melodies, small data melodies in the beginning because it was fun and later because it was useful. I used this skill in the business world to improve the communication with non-technical business partners. So I created small and easy datasets and visualizations about topics that the business partners felt comfortable with and used them to explain concepts, explain methods and in general to create a common ground for discussion and for questions.

When you compose a song for your band it is not enough to have a beautiful melody for each one of the instruments but you also want them to sound good together. So in DataWords we do not only want independent variables but we also want them to make some sort of sense together and I trained that skill and used it a lot because we were working with sensitive data so you could not just send data to someone else with a question because not everyone had the same level of access to the data. And instead I quickly assembled a suitable dataset for each occasion and was able to ask questions but also to give advice without having to see the real data or having to touch the real data.

And instead I quickly assembled a suitable dataset for each occasion and was able to ask questions but also to give advice without having to see the real data or having to touch the real data.

Another application emerged when we had more and more agile projects so we had this aim of having a working deliverable in each iteration and this was sometimes hard to reach because we had to wait a lot of time for the data so we had to talk to the data owner, sometimes request accesses, we had to bring the data into the right format. And by using a composed dataset that looked like the expected final dataset we could already start to work on the output, on the report and while we were still waiting for the real data and in the end we simply replaced that composed dataset with the real dataset and improved the whole process a lot.

Composing full databases

When you want to write a symphony you're not just aligning a few instruments but you have actually blocks or groups of instruments that have to sound good together and in DataWords that could be a database where you have several tables that are all connected with each other. And it is not easy to create a complete database from scratch but we still did it to improve our internal courses so we wanted to make those courses closer to what we actually work with and we created a client table, transaction table, account tables and combined all of them and made this whole database available to everyone in the company, to the internal courses and also for small prototypes and projects.

How to compose with R

All this was about why data composition is important. Now how to compose with R, there are of course the distributions I already showed in the beginning how easy it is to work with the distributions and there are also a lot of functions that can help you to correlate your simulated data to create those connections between the variables, there are functions that help you to create the relations between tables and of course there are a lot of packages out there that just wait to be explored that can support you on your data composition process.

And as I just have a few seconds left I wrote a blog post for each one of those four points, you can, I invite you to visit my blog and there are a lot more details about those topics and I hope they might be helpful for you to become a data composer too. Thank you very much for your attention.