Santiago Rodriguez | Intro to functional data analysis | RStudio Meetup

Transcript#

This transcript was generated automatically and may contain errors.

Awesome. Well, let's get started here and we can let people come in from the waiting room as they join as well. But hi everybody. Thank you so much for joining us today. Welcome to the RStudio Enterprise Community Meetup, our energy meetup today. I'm Rachel calling in from Boston. I'm actually at the RStudio office today in the seaport.

If you just joined now, feel free to introduce yourselves through the chat window and say hello, maybe where you're calling in from. I like to just let people know at the beginning if you want to turn on live transcription for the meetup, you can do so in the zoom bar below if you just press more.

But to go through a brief agenda, we'll have some short introductions of the meetup and welcome everyone here. Our introduction to functional data analysis with Santiago Rodriguez and then lots of time for questions and open discussion at the end as well. So just a reminder, this meetup will be recorded too, so it will be shared to the RStudio YouTube.

To ask questions, if you don't want to be part of the recording, you can always use the Slido link that I'll share in just a moment in the chat. Yeah, you can put your name in there too if you want and I can call on you to ask the question live or you could ask anonymously.

And for anybody joining for the first time, welcome. This is a friendly and open meetup environment for teams to share the work they're doing within their organizations, teach lessons learned, network with each other, really just allow us all to learn from each other. So thank you all so much for making this a welcoming community too. We really want this to be a space where everybody can participate, we can hear from everyone, regardless of your level of experience or the industry that you work in.

With that, thank you all again for joining us. I would love to turn it over to our speaker, Santiago Rodriguez. Santiago and I are friends from LinkedIn. That's actually how we first met. Santiago is a data scientist in the marketing department at Consumers Energy, a Michigan-based public utility.

Introduction

Hi, Rachel. Hi, everybody. And that's exactly right. I reached out to Rachel one day because she had a really interesting post on some functions in R that I hadn't heard of before. I thought they were great. So I just sent a, hey, cool post and just ended up here. Maybe that's inspirational for somebody. If you have something you want to share, reach out. There's a community.

Okay, so let's get started. Our agenda today is going to cover some introductions. I'm going to introduce myself, the topic, a little bit about Consumers Energy because I'm using some of their data. So I want to give a shout out, a definition as to what functional data analysis is. And then the meat of the presentation is really examples and applications of a functional data analysis, this method of analysis. And then we'll wrap up with some resources in case anybody's interested in how I got started.

All right, allow me to introduce myself a little bit further. I was born in Ecuador, South America. I saw somebody in the chat was from Columbia. Hello, neighbor. I grew up in South Florida. And I've lived in Dallas, Texas for seven years or so. I have a bachelor's from Florida State University and a master's in statistics from Texas A&M. I currently work as a data scientist in marketing. And I've had the pleasure of working across different industries, primarily because I'm a learner and I love learning new things in different industries, different functions.

And then when I'm not at work, I'm primarily reading, usually stats books. And if it's not stats books, it's fiction, nonfiction. I like to travel. This year, my wife and I have dedicated time to traveling. And I love to fish. I grew up in South Florida. There's a body of water around every corner.

Allow me to introduce Consumers Energy. They're the sponsor of the data for this presentation. It's a public utility founded in Michigan. They serve the majority of Michigan's residents and have a generation capacity of about six gigawatts.

And then about today. My talk will be primarily descriptive, non-technical. We won't get into math or code, really. It's all about what is functional data analysis and how can you use it, what are applications. And my goal today is to introduce this relatively young branch of statistics and then show you that it has value, that it can add value, that it's worth your time to explore and maybe learn as well.

And I got started with this. I wanted to share that really quickly. I'm by no means an expert on functional data analysis. I've been playing around with this stuff for a few months, probably close to a year. And in the utility space, meter reads are our primary data source for a lot of things. And the time series nature of meter reads allow you to do a couple of different interesting things. If you're a time series person, you have decompositions and other more traditional stuff. I found that functional data analysis with meter reads was such a perfect combination. And I'll show you some of these examples in a bit.

As far as logistics, I've added little break points into the presentation, partly to give me a break, get a glass of water if I need. And if you have any questions, you can ask a question in that section. That way, you don't forget or I don't forget what the heck I was talking about.

What is functional data analysis?

All right, first up, this is an academic section, sort of, it's a definition, what can we do with functional data analysis? It's the only math formula formula you'll see today. And it's useful upfront before we get into applications to define what this is, what functional data analysis is. And FDA is, that's the acronym FDA functional data analysis. It's an analysis of information on curves and functions. For our purposes today, we're going to highlight curves. But if you work in, let's say, with spatial data, you can use functions as well.

Functional data analysis is essentially a non parametric flexible regression technique that is used to approximate a curve. It's used to approximate a curve or a function via a linear combination of basis functions, it looks like a regression formula, you have coefficients and points, right data. In this case, there's a there's a slight tweak, where what we're trying to do is using these red dots, in this case, these are meter reads our data, using the data, we're going to fit a curve, right, we're going to approximate a curve using basis functions. And that's that fee function in the formula, it's a non parametric function, and has the form of the bottom right graph, it looks like lines.

And that's, that's essentially it, I don't want to get too into the weeds here. You have a formula, you have coefficient coefficients that you estimate, you can do your usual stuff, like these squares. And then you have these, the key here is these, this fee function that does something that creates these curves, right, and then you fit your data with those.

Naturally, the question is, all right, we fit these curves, now what can we do with these curves? And it turns out you can do quite a bit, actually, you can do descriptive stats, max, min, median, variance, confidence intervals, you can do interpolation things such as connecting the dots, right, if you have meter reads, and you want to show a line, instead of using the default plotting method, you could use functional data analysis and connect the dots with a little bit, a little bit more smoothly. You can do extrapolation, functionally, that looks like functional regression, if you want to do some inferential type work, if you're doing some predictive work, you can use GAMS, if you've ever used that, you're using functional data analysis on the back end, that uses cubic splines. A relatively newer approach is clustering, you can cluster the curves. And then if you're doing time series analysis, you can use some of these techniques as part of that as well.

And here's the pitch, right? I think it's worthwhile to explore functional data analysis, I know I found it useful. And it's another tool in your toolbox. And if a question comes along from a stakeholder or something, you find something, you have potentially a better fitting tool for the job. And if you've ever worked with your hands, or you've ever built anything, you know that the right tool makes the job so much easier.

And if a question comes along from a stakeholder or something, you find something, you have potentially a better fitting tool for the job. And if you've ever worked with your hands, or you've ever built anything, you know that the right tool makes the job so much easier.

As far as fitting the curve, as far as the method goes, it's quite flexible. It allows you to fit a curve via that phi function, right, that nonparametric function, a number of different ways. Fourier bases are really good for periodic data. And in this case, meter reads tend to be periodic or semi periodic, there's some kind of cyclicality to things. So it's a it's good for that. E splines, a common word, cubic splines, wavelets, etc. And then it's not only about fitting the curve, it's about fitting the right curve, right, you want to make sure you've you've fit, you've accounted for residuals, you're fitting the right curve, you're not overfitting, you're not underfitting. And there's a couple ways to fit the optimal curve, you have least squares methodology, as well as a penalization technique.

And then arguably, the coolest thing about functional data analysis, and the curves that you fit, is that those curves are differential. Now, at first, I didn't realize how useful that could be. And I kind of glossed over it. And I've recently gone back and learned more and explored that area further, because it opens up a treasure chest of information. I think some areas have experience working with derivatives. For example, engineers might look at rates of change. physicists use derivatives often, right. But if you work, like I've worked primarily in commercial functions, marketing, finance operations, I haven't had a need that to look at derivatives. But it really is interesting. And I have an example of that in a bit.

In summary, the R ecosystem for functional data analysis is really rich. If you want to do something, it's there.

Thank you so much, Santiago. That was awesome.

Yeah. One of Joseph's posts on Rview, the Rview's blog, shows a ... It looks like ... Shoot, I'm drawing a blank on the name, but it's a map of how the different packages relate to each other. At the center is the refund package that's getting a lot of love in the chat section, but also FDA. The FDA package and the refund package are pretty much the source of all newer works. That's really cool. I would say check that out. That's a really neat image.

Yeah. There's a lot out there. The ecosystem's so rich. Explore if you're interested.

The question was, on the weather time series analysis, based on what criteria will we give color? I think I get it. In the traditional clustering paradigm, you can ... Since our features are so related, you're dealing with collinearity, dimension reduction is almost a necessity. Using PCA on the features, treating them like independent features, and extracting some PCs is the way to go. I would say that I think that answers that question from a traditional perspective. From a curve clustering perspective, it's doing pretty much the same thing. It's doing like you've seen in the chat, probably a functional PCA. It's doing that on the background, in the back end too, and extracting some kind of variance and extracting the PCs based on some kind of variance metric maybe.

I see someone else said, I know the GAMS course or G-A-M. What is the difference between GAMS and FDA? Are they the same thing? Sort of. Yeah, I mean, yes, I guess in short, yes. GAMS use cubic spline so that if you remember that early math formula, that linear combination, if you think of like the regression component, you have your intercept and then you have beta, give your coefficient times some data, coefficient times. That phi function builds out the data component, the non-parametric data component, right? That non-parametric function. There are different ways to fit that in GAMS. I don't know if you can switch that. I've only ever seen it with cubic splines, in which case that is FDA, that is functional data analysis.

Let's see, you can generalize further and do maybe like a functional regression on your own using the functional data analysis techniques and then define the phi function as whatever you want it. Fourier, the Fourier series, wavelets, monotone, polynomial, whatever. But I think GAMS specifically use cubic splines.

The question was curious to know if anyone also tried using wavelet for curve modeling.

I haven't yet. In one of the textbooks by Professor Ramsey, he touts it as supposed to be pretty good. I haven't really worked with it too much. I've mostly stuck with the Fourier series, the Fourier basis expansion, and cubic splines.

For me to read, for me to read, I found the Fourier series, the Fourier basis functions work really well, except in rare cases where, like I've seen this in business customers more, where consumption can be, let me collect my thoughts, the Fourier series works really well for periodic beta, right, because it's built by sines and cosines, but it doesn't do very well when things are, like if something's jagged, it doesn't jump up very well, doesn't adjust up to these, there's a term, but I'm looking at like, if your consumption was like this, right, you go from nothing straight up, back down, these very jagged rises and peaks, Fourier series won't approximate that very well. Cubic splines would fit that much better, but there are downsides to that too, right, because you have to account for differentiability if that's what you're interested in, or the endpoints with cubic splines can be kind of wonky, so if you're trying, if you're doing something there, that might get weird.

The anonymous question was, they need to know how to change the data class for all variables from double to original class without doing it manually when they import a CSV file. Their further question was, they want to change everything for like 50 variables at once, and how to do that. Yeah, it's actually pretty easy, and all thanks to the tidyverse , funny enough, so shout out RStudio. Using stuff like the across function, dplyr across, you select your fields, for example, if you were interested in, if it was all fields, it'd be super easy. If you need a certain number of fields, you can do something like dplyr across columns, select where it contains, for example, if you have a set prefix or suffix or something, and then function would be whatever you need to adjust it by.

I, so as part of this, since I'm using actual customer information, I have to add noise to meet compliance standards and all that stuff, and I take, I do that very thing. I take, I read in the data, the raw data, and then I do, by using the tidyverse, I do an across function, apply noise, and then that just takes care of everything automatically. There's probably other ways to do it, but the tidyverse would be probably my go-to. It's easy.

And then another question was, how often do you need to refit the curves or clustering?

It depends. It's almost, if I were to generalize that a little bit further, if you were, depending on what you were doing, if you were doing a one-off analysis, then only once, right? If you were putting this into production to do something like predicting something, or clustering groups of customers, then you're entering a broader conversation about when do you retune models in general, right? So, if you're doing something in production, you should have some sort of like data drift type pipelines that assess as your data changed. There was a previous comment about outliers. If your input data has tended to be one way, and then all of a sudden just behavior changes, let's say, due to COVID or something else, then it's probably time to retune, right? You should have a flag that says, hey, just a heads up, or even break your procedure to say your input data is way different than what it used to be.

That's kind of a broader topic. There's no, I don't think there's a set defined period, but there's a lot of research out there to help guide that conversation as to when, how often, and it probably depends on the nature of your data, too. If you're doing something at the daily level, it's probably more frequent. If you're working on aggregated yearly meter reads, then once or twice a year would probably suffice.

Hi, yes, I was still thinking about the clustering, and I was thinking about what if there's a, there's sort of bias in the data in terms of for my hundred customers, I do not have data, the same amount of data. For one customer, I only have data for one month, and then maybe a couple of days in one month, and for some customers, I have every day for six months, and like for that variation, would that affect in the clustering?

It shouldn't. The functional data analysis techniques are really robust. In my case, since I'm, since I've typically worked with, like I'll prep the data, right, I will look at only customers who have had at least six months of meter reads. I'll eliminate anywhere where consumption is zero, because some people might just activate a meter, never use it, right, get rid of those people. So, if you prep, I've used, what do you call it, equally spaced basis functions, like the, the, they're called knots. The number of knots can be equally spaced. In my case, for example, in this, in the presentation today, there are 24, right, because I'm working 24 hours in the day. I define a knot at each hour of the day, but let's say in your case, you had month one, month two, month five, month seven, month 12, and you wouldn't use equally spaced knots, because you don't have them. What you might do is fit curves individually to each customer using whatever data you do have, whatever, like whatever periods you have, right, whatever knots you have, define those for each, each customer, and then work with those, the respective curves.

For, for example, with my work, since I don't have that issue, I can work with all the customers at once, and that, and that clustering, no, and actually, yeah, and that white paper I was telling you about, I'm working with, it's about 60,000 customers, and so it's a, it's a matrix of by 60, it's a 60,000, and by, not the other way, it's 24, and by 60,000 feet, right, so my customers are, my features are really wide, I can just treat them all at once, because the data is the same

Santiago Rodriguez | Intro to functional data analysis | RStudio Meetup

Transcript#

Introduction

What is functional data analysis?

First Q&A

Load profiles

Second Q&A

Derivatives

Third Q&A

Curve clustering

Final Q&A and resources

Featured software#

rstudio