Resources

Santiago Rodriguez | Intro to functional data analysis | RStudio Meetup

Energy Meetup - Intro to functional data analysis Presented by Santiago Rodriguez 2:49 - Start of talk 12:15 - First Q&A point 23:23 - Second Q&A point 31:56 - Third Q&A point 40:44 - Final Q&A point Abstract The focus of this talk is to introduce functional data analysis (FDA) and to showcase some of its applications in the utility space. A primary source of data in a utility are meter reads. This data is periodic and appears discrete, but energy consumption is continuous, which makes meter reads a perfect use case to apply FDA. The talk will highlight two applications: load profiles and segmentation. The talk will be non-technical - no math or code - because the goal of the talk is to persuade you to investigate FDA on your own. Speaker Bio Santiago is a data scientist in the marketing department at Consumers Energy, a Michigan-based, public utility. He focuses on data engineering, data science, and MLOps. Santiago has about a decade of experience working in the analytics space across energy, aviation, automotive, and contact centers. He has a bachelor’s in finance from Florida State University and a master’s in statistics from Texas A&M. Resources shared in the chat: FDA descriptive statistics blog post: https://lnkd.in/gZjauTdt Intro to FDA Blog Post: https://lnkd.in/gM5NCbHe Refund package is introduced in 'Introduction to Functional Data' by Kokoszka and Reimherr: https://lnkd.in/gd5Bt3tE Gaussian Process Regression Analysis for Functional Data by Shi & Choi: https://lnkd.in/gAig-AZa GAMs in R, free course - Noam Ross: https://lnkd.in/gput2QbK Jiguo Cao - Intro to FDA on YouTube: https://lnkd.in/ga4qfGDr Feedback: rstd.io/meetup-feedback Talk submission: https://lnkd.in/gJ7EUSCk If you'd like to find out about upcoming events you can also add this calendar: rstd.io/community-events Packages shared in the chat: Task view https://lnkd.in/gBqtuV2b FDA: https://lnkd.in/ge85i6UE Refund package: https://lnkd.in/gXGcT79t FDA.USC: https://lnkd.in/ggHdUaEe

Mar 24, 2022
1h 0min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Awesome. Well, let's get started here and we can let people come in from the waiting room as they join as well. But hi everybody. Thank you so much for joining us today. Welcome to the RStudio Enterprise Community Meetup, our energy meetup today. I'm Rachel calling in from Boston. I'm actually at the RStudio office today in the seaport.

If you just joined now, feel free to introduce yourselves through the chat window and say hello, maybe where you're calling in from. I like to just let people know at the beginning if you want to turn on live transcription for the meetup, you can do so in the zoom bar below if you just press more.

But to go through a brief agenda, we'll have some short introductions of the meetup and welcome everyone here. Our introduction to functional data analysis with Santiago Rodriguez and then lots of time for questions and open discussion at the end as well. So just a reminder, this meetup will be recorded too, so it will be shared to the RStudio YouTube.

To ask questions, if you don't want to be part of the recording, you can always use the Slido link that I'll share in just a moment in the chat. Yeah, you can put your name in there too if you want and I can call on you to ask the question live or you could ask anonymously.

And for anybody joining for the first time, welcome. This is a friendly and open meetup environment for teams to share the work they're doing within their organizations, teach lessons learned, network with each other, really just allow us all to learn from each other. So thank you all so much for making this a welcoming community too. We really want this to be a space where everybody can participate, we can hear from everyone, regardless of your level of experience or the industry that you work in.

With that, thank you all again for joining us. I would love to turn it over to our speaker, Santiago Rodriguez. Santiago and I are friends from LinkedIn. That's actually how we first met. Santiago is a data scientist in the marketing department at Consumers Energy, a Michigan-based public utility.

Introduction

Hi, Rachel. Hi, everybody. And that's exactly right. I reached out to Rachel one day because she had a really interesting post on some functions in R that I hadn't heard of before. I thought they were great. So I just sent a, hey, cool post and just ended up here. Maybe that's inspirational for somebody. If you have something you want to share, reach out. There's a community.

Okay, so let's get started. Our agenda today is going to cover some introductions. I'm going to introduce myself, the topic, a little bit about Consumers Energy because I'm using some of their data. So I want to give a shout out, a definition as to what functional data analysis is. And then the meat of the presentation is really examples and applications of a functional data analysis, this method of analysis. And then we'll wrap up with some resources in case anybody's interested in how I got started.

All right, allow me to introduce myself a little bit further. I was born in Ecuador, South America. I saw somebody in the chat was from Columbia. Hello, neighbor. I grew up in South Florida. And I've lived in Dallas, Texas for seven years or so. I have a bachelor's from Florida State University and a master's in statistics from Texas A&M. I currently work as a data scientist in marketing. And I've had the pleasure of working across different industries, primarily because I'm a learner and I love learning new things in different industries, different functions.

And then when I'm not at work, I'm primarily reading, usually stats books. And if it's not stats books, it's fiction, nonfiction. I like to travel. This year, my wife and I have dedicated time to traveling. And I love to fish. I grew up in South Florida. There's a body of water around every corner.

Allow me to introduce Consumers Energy. They're the sponsor of the data for this presentation. It's a public utility founded in Michigan. They serve the majority of Michigan's residents and have a generation capacity of about six gigawatts.

And then about today. My talk will be primarily descriptive, non-technical. We won't get into math or code, really. It's all about what is functional data analysis and how can you use it, what are applications. And my goal today is to introduce this relatively young branch of statistics and then show you that it has value, that it can add value, that it's worth your time to explore and maybe learn as well.

And I got started with this. I wanted to share that really quickly. I'm by no means an expert on functional data analysis. I've been playing around with this stuff for a few months, probably close to a year. And in the utility space, meter reads are our primary data source for a lot of things. And the time series nature of meter reads allow you to do a couple of different interesting things. If you're a time series person, you have decompositions and other more traditional stuff. I found that functional data analysis with meter reads was such a perfect combination. And I'll show you some of these examples in a bit.

As far as logistics, I've added little break points into the presentation, partly to give me a break, get a glass of water if I need. And if you have any questions, you can ask a question in that section. That way, you don't forget or I don't forget what the heck I was talking about.

What is functional data analysis?

All right, first up, this is an academic section, sort of, it's a definition, what can we do with functional data analysis? It's the only math formula formula you'll see today. And it's useful upfront before we get into applications to define what this is, what functional data analysis is. And FDA is, that's the acronym FDA functional data analysis. It's an analysis of information on curves and functions. For our purposes today, we're going to highlight curves. But if you work in, let's say, with spatial data, you can use functions as well.

Functional data analysis is essentially a non parametric flexible regression technique that is used to approximate a curve. It's used to approximate a curve or a function via a linear combination of basis functions, it looks like a regression formula, you have coefficients and points, right data. In this case, there's a there's a slight tweak, where what we're trying to do is using these red dots, in this case, these are meter reads our data, using the data, we're going to fit a curve, right, we're going to approximate a curve using basis functions. And that's that fee function in the formula, it's a non parametric function, and has the form of the bottom right graph, it looks like lines.

And that's, that's essentially it, I don't want to get too into the weeds here. You have a formula, you have coefficient coefficients that you estimate, you can do your usual stuff, like these squares. And then you have these, the key here is these, this fee function that does something that creates these curves, right, and then you fit your data with those.

Naturally, the question is, all right, we fit these curves, now what can we do with these curves? And it turns out you can do quite a bit, actually, you can do descriptive stats, max, min, median, variance, confidence intervals, you can do interpolation things such as connecting the dots, right, if you have meter reads, and you want to show a line, instead of using the default plotting method, you could use functional data analysis and connect the dots with a little bit, a little bit more smoothly. You can do extrapolation, functionally, that looks like functional regression, if you want to do some inferential type work, if you're doing some predictive work, you can use GAMS, if you've ever used that, you're using functional data analysis on the back end, that uses cubic splines. A relatively newer approach is clustering, you can cluster the curves. And then if you're doing time series analysis, you can use some of these techniques as part of that as well.

And here's the pitch, right? I think it's worthwhile to explore functional data analysis, I know I found it useful. And it's another tool in your toolbox. And if a question comes along from a stakeholder or something, you find something, you have potentially a better fitting tool for the job. And if you've ever worked with your hands, or you've ever built anything, you know that the right tool makes the job so much easier.

And if a question comes along from a stakeholder or something, you find something, you have potentially a better fitting tool for the job. And if you've ever worked with your hands, or you've ever built anything, you know that the right tool makes the job so much easier.

As far as fitting the curve, as far as the method goes, it's quite flexible. It allows you to fit a curve via that phi function, right, that nonparametric function, a number of different ways. Fourier bases are really good for periodic data. And in this case, meter reads tend to be periodic or semi periodic, there's some kind of cyclicality to things. So it's a it's good for that. E splines, a common word, cubic splines, wavelets, etc. And then it's not only about fitting the curve, it's about fitting the right curve, right, you want to make sure you've you've fit, you've accounted for residuals, you're fitting the right curve, you're not overfitting, you're not underfitting. And there's a couple ways to fit the optimal curve, you have least squares methodology, as well as a penalization technique.

And then arguably, the coolest thing about functional data analysis, and the curves that you fit, is that those curves are differential. Now, at first, I didn't realize how useful that could be. And I kind of glossed over it. And I've recently gone back and learned more and explored that area further, because it opens up a treasure chest of information. I think some areas have experience working with derivatives. For example, engineers might look at rates of change. physicists use derivatives often, right. But if you work, like I've worked primarily in commercial functions, marketing, finance operations, I haven't had a need that to look at derivatives. But it really is interesting. And I have an example of that in a bit.

First Q&A

There's a question in the chat here about overfitting. That's, that's definitely possible. If you were going to do any sort of, if you were just connecting the dots, and you probably wouldn't care too much about the fit. But if you're doing anything inferential, or you want to take into account for overfitting or underfitting, I would suggest doing a split the data, your typical hypertuning, try different number of basis functions, see which one has the least, the best sum of squares error. You could also use the penalization technique. That's, I found that to be probably the best approach. In short, in summary, there are ways to account and address underfitting and overfitting.

There was an anonymous question that was what is the sample size you need to fit the curves?

Good question. Let's see, I am working on stuff on the side. So everything I'll share after this section are snippets of applications that I'm working on as proof of concepts at work. And I've worked with millions of records, because you figured time series data can go, it gets so big so fast. Or you can work just 24 points, right? If I, if I'm working with summarized meter reads, let's say I have, that's actually a really funny question. It's a perfect segue into the next section. Let's say I have a year's worth of meter reads, I summarize them by day, right? So I do the mean by hour, the 24 hours, and I'll end up with 24 points, right? That's like the slide, the picture we saw earlier with those red dots are 24 red dots, one for each of the day, each of the hours. That's all that's all you need to fit the curve.

If you wanted to do a little bit more, maybe some inferential work, like build confidence intervals, then just having those 24 summarized points won't do you any good because you have no way to construct the standard deviation, you don't have that sigma matrix. So you can work with, you'll see this in a second, you can work with the summarized information in just the 24 points in this example, or you can work with all the data, and then you have access to more information, you can do a little bit more. But depending on your machine, depending on your hardware, that gets pretty out of control quickly.

Load profiles

Okay, so we talked about summarizing the time series data. And in the utility space, I found that this is probably the most common way to do that. It's called the daily load profile. It is a way to summarize the consumption information to easily see trends, right. And there's a couple different ways to do this load has different conditions and consumption and demand. The approach here is functional data analysis technique, it works for however you choose to summarize the data. I'm in marketing. So I tend to use the mean or some kind of like centrality type measure mean median to summarize, but you could do the max if you're more focused on the create an electrical generation capacity, if what's the maximum load that the system can handle, or you can do the average.

And once you have these low profiles, you can connect the dots like we talked about functional data analysis, that was one of the questions, how many, how many points do you need? As little as whatever you have 24. In this case, or as many as you have, you can build confidence intervals. And I find that's useful instead of just providing a point estimate, provide a little bit more clarity as to what's the variance look like. And then the daily load profile is the most classic decomposition here. But there are others, you can decompose it any other way.

This is our first example. This is time series information. This is an actual customers meter reads. This is only two weeks, I'm working with about two years worth of data on the back end, that would look very messy if I plotted the whole thing. But this is essentially the underlying data structure. It's a time series. You can see there's like a, it's not really upward trending or anything, but there's some cyclicality to it, there's ups and downs. So if you've studied time series, you can look this and you can extract information from it. But it's useful to summarize and decompose this information. And that's where this daily load profile, load profiles in general come in.

That's step one, we're going to decompose that time series. And what you're looking at here is two years worth of meter reads. On the x axis, you have your day, 24 hours of the day, and your measure of consumption on the y axis, each dot represents a meter read for a particular day and hour, right? So you have about 700 days plotted here.

Okay, step one, step two, we're going to apply functional data analysis. And I don't show how to do that or anything. This is just a showcase what you can do with it. And this always makes me trouble because it looks like static on a screen, kind of artsy, if you've ever seen those, those memes or about deep learning art, and what that looks like in the back and the kind of looks like this, it's funny. What we've done is we fit a functional curve to every day, right? So for the last two years worth of meter reads, we fit a curve to every one of those days. And you can you can see a couple things from here. I don't get too into that. But it looks like at the bottom portion of consumption. So from between zero and 400 is where the majority of the consumption for this particular customer in the last few years is. And there's some, there's some, like outliers, I guess, out beyond that 400 range.

Step three is summarize, right? Because this is useful in to do a couple different things that we talked about. You can summarize these curves, we can summarize all these curves into a single curve, we can build confidence intervals. I would never present this to a stakeholder, because they would look at you like you're crazy. But this is much more readable, right? So we've summarized all those curves to produce a point estimate curve, right? We fit the curve to these 24 dots, meter reads, we extracted the standard deviations, the variance component, because we had all those curves, we were able to build confidence intervals around that. And that allows us to answer questions depending on what that may be. But you can say something like, on average, we are 95% confident that consumption at this point in time is between A and B, right? A little bit more of a complete picture.

And that's the daily load profile, right? But there are other ways to decompose a time series, you can do day a week, day a month, month a year, be creative, there might be seasonal components, other than that one to 24 hour period. And those three steps are the same across any of these decompositions. Decompose, you choose your decomposition, you fit functional data analysis, and then you summarize.

And this is in part hosted by RStudio. So I, as I was putting this together, I thought, well, one thing you can do was decompose the time series in a couple of different ways, collect those decompositions, put them together in some shiny app dashboard. And you will very quickly have a rough idea as to what the average behavior for a customer is. For example, the first time you see a frown on the top left, I probably wouldn't use a line for this, but you could maybe bar chart would be more appropriate. But you see that for this customer, it's a business customer, and you see that consumption is highest between Monday and Friday. And there's an interaction effect on the weekends, right consumption drops at the weekends. On the bottom right, you see the month and consumption tends to be higher on average in the winter months versus the summer. It's kind of useful to see experimenting with different things playing around. It's a very flexible technique.

Second Q&A

One was often with meter readings, you can have a lot of noise, do you need to clean the data in advance to remove outliers and other noise in your data?

Good question. We have a whole, I guess, two parts. The first part, the raw meter reads are cleaned up from a production system by our engineers. So I'm extracting this from an already cleaned state in a database somewhere. I don't do too much outside of that, whatever they've done, I don't do too much. I just take it all in because I'm summarizing that information. If there are outliers, they're kind of taken care of by averaging, right? If you're afraid of, if you're afraid that there are too many outliers, instead of summarizing by the mean, you can do the median. Or you can do a pre-processing step to say, you want to use, there's so many different ways to define an outlier. However you choose, you can pre-process and eliminate those data points and then just work with what remains. But I didn't do any of that here. I used the whole two years.

Yeah, there's one other question that was, can the FDA approach be used to estimate deviations from the norm, like theft of electricity?

I would imagine. It's a pretty robust technique. I haven't done any sort of anomaly detection. But I would imagine you could. For example, if you had an estimate of what normal consumption is, and then all of a sudden, there's a spike, you would see that, right? Because if you built a confidence interval around what you thought to be true consumption, true behavior, built your confidence interval to be like, super wide 99%, right? Then if something did come in that was way outside of that band, it might be positive to flag it and investigate it further.

Derivatives

Derivatives, I found this to be really interesting. I, outside of calculus, didn't think I would use derivatives at work. But it turns out to be really useful. And one of the things I'm working on now is sort of a feature engineering technique to build the features for machine learning, predictive type work. I'll show you a little bit about what I mean by that. It's just opened up a new realm of analysis for me that I've been fascinated by.

And we're going to revisit this, we fit this, this curve to this customer. And now we're going to differentiate this curve. And we're going to look at the velocity and acceleration for this particular customer. So this average, average behavior curve. And there's a couple of interesting things here. If you look at the max velocity and the minimum, right, these global, global extrema, if you're almost able to describe some of the behavior of the customer, right, so you can say something like, perhaps on peak hours for this particular customer is between 8am and 5pm. Now knowing that this is a business customer, you're like, yeah, that makes sense. And if you wanted to do, for fun, if you were doing calculus, you can do the second derivative test, right, where the first derivative is zero, look at the second derivative, is it greater than zero or less than zero, that tells you something about whether that that time is a max or minimum. Kind of cool.

Next, this is a classic, this was new to me, this is a classic way of looking at derivative information in engineering and physics. And it's the same information we just looked at, but it's slightly presented in a different manner. It's really good for cyclical type data. In this case, it works really nicely consumption and meter read type data. And what you see is what we said earlier, all the way to the right on the positive on the x axis, you have velocity, you see eight, this 8am is max velocity. If we look all the way to the left, we see 5pm minimum velocity. And then you see these, these cycles, almost these circles, right, you have this big circle to the right and positive velocity section, you have a small circle center around zero, maybe like a plateau in consumption. And then you have two periods of ramping down in consumption, one bigger than the other.

And that's essentially what's happening here, right? Between the hours of four and 10am, the consumption for this particular customer on average starts to ramp up and peaks at 8am. And then between somewhere in the midday, it plateaus, then you have two periods of ramping down, the primary ramping down period is between about two to 8pm, six, seven, yes, by 8pm, then you have this final ramping down for the day, it's almost like they're closing up shop for the end of the day. That's pretty interesting. I personally have never looked at it this way, I find it really interesting to describe behavior, customer behavior in this way. And it's another descriptive feature, to add information in context to any analysis you're doing.

And we can translate translate this back to the original, the original data, right? We have those four periods, the first period of the day is ramping up, you have this plateau in consumption, that primary ramp down. And then you have that closing up shop period. And then the cycle repeats, ramp up, plateau, ramp down, ramp down.

But I found this really fascinating. I thought this is really cool. One of the things I am working on, as a proof of concept at work, is to extract those global extrema, using the first derivative as individual on peak periods. For example, right now, at Consumers Energy, let's say we have a predefined window for what we call on peak hours for residential customers. I don't remember how we came up with those, but let's say they're like from A to B. Well, that's very broad, and we ask customers to fit into that mold. But using something like derivatives, you can, you can define what on peak hours are for each customer. So you can be hyper personalized, be much more customer centric. So I'm building something now, I'm building a data set for all our customers to define those personal on peak hours to potentially use and explore in as features in other type of works.

Third Q&A

Would you want to jump in and ask it live? I'll ask a question. But it is what are the approaches to account for heat? Oh my gosh, I'm not gonna say this word. heterogeneity, other than fitting separate models to investigate interactions with some x variable like location.

Good question. I don't, I haven't done too much of that. My there's functional regression, which would help you with that. If I understood what you're saying, correctly, if you had other components, and you wanted to do some sort of functional analysis with you could use functional regression, it's very similar to what you're used to just with a layer of an abstraction with functions or curves, as opposed to individual features.

Hi, I really liked the idea. Can you hear me? Yes, I can. Yes. I really liked the idea about using the derivative to find the peak period. So I was thinking of if I have meter reads from let's say 100 customers, and then I wanted to find which customers have the same peak periods. So I could see relatively how I could do it for one and then do it for all of them. But how my general idea is I was thinking I could write a code, which would give me okay, these customers are always these are the serial numbers of the meters of customers who have peak periods between eight and five. These are customers which have peak periods between six and 10. I don't know that that's what I was thinking. Is that something where you think this can be applied?

That's a good question. And it's very pressing. That's actually exactly what I'm working on right now is once you've defined it's almost two steps, right? Once you've defined these on-peak periods for all customers, you can aggregate that information to construct group level on-peak periods, right? So let's say I have in your example 100 meter reads. I've defined what the on-peak periods are using derivatives. Then you can do one of two things really for your purpose. You would probably just cluster. You could just cluster that information and see which customers have similar on-peak periods. That's probably how I would approach that problem. And then I'm working on a internal white paper that does that very thing actually. It clusters about like 60,000 customers worth of information and defines on-peak periods for each cluster.

Yes, yes. Because the type of data I work with is renewable energy. So we wanted to know which people are using solar will be best fit for solar energy because their on-peak period is in the daytime, which people will be best for other types of energy because their on-peak period is in different.

Yeah, so that's where I was heading. Yeah, that's actually the questions have been great. That's a perfect segue into the next section. There's been some recent research about clustering these curves. So you could, for example, fit a curve to all of your 100 customers and then cluster the curves themselves as opposed to treating the time series measurements. Let's say you had 24 hours. Instead of treating the 24 hours as individual features that are highly correlated, you can cluster the curves. And funny enough, that is the next section.

Curve clustering

It's almost, as I've been learning this technique and applying this, the fitting of the curves, kind of like connecting the dots, derivatives, and then clustering, has been my progression through this learning process. The first thing was just getting comfortable and fitting the curves, trying different ways to fit the curves for your bases, cubic splines, wavelets, you name it. And then getting the derivative information, how is that useful, building use cases, kind of showing the utility. And then the question came up, just like yours, well, what else can we do with that? And it just occurred to me, well, I'm in marketing. A lot of what we do is segmentation and clustering, and we want to be more targeted in our messaging and all that. Well, can we cluster this information? Can we cluster this stuff? And it turns out you can.

And for this, I'm using the Hello World example of functional data analysis. There is a package in R called FDA in it. There's this data, if you want to get in and play with it. It's averaged Canadian weather temperature for, I think, 36 or 50, some number of Canadian stations, some weather stations in Canada, for a year. So they've averaged this. It looks almost like a low profile, but instead of being hours, you're looking at days in the year. And they've averaged the temperatures. The red line there is the average of the group. And before I proceed, take a look at this and ask yourselves, if you had two crayons or coloring pencils or pencils, whatever, and I asked you to color in the lines based on how you would group these things into two different groups, how would you do that?

Keep that in mind. How we've done it traditionally at work has been to treat the features, in this case, hours of the year, days of the year. In my case, it's been hours in the day, 24. In this example, it's 1 through 365. Treat those features as features in a clustering, in a traditional clustering process. The features are highly correlated, right? They're dependent on each other. So you use some kind of dimension reduction technique, principal component analysis, for example. In this case, I extracted two PCs. That was like 95 or 97 percent of the variance. And this is what comes out. And it looks like it makes sense, right? So there's not really any questions. It looks pretty straightforward. And this is your traditional clustering. PCA, dimension reduction if you need it, and you do need it, and then k-means.

And then this is relatively newer. This is clustering the curves themselves via something, it's a library in R called FunHDDC. I think the first paper describing this came out 2017, 2015 maybe. I'm not sure, but it's recent. And it does something very similar, but from a functional perspective. And it looks very different from the clustering outcome, looks very different from the traditional clustering measure. So this is why early on I said it's another tool in your toolbox.

Different results, depends on your use case. Maybe this better aligns with what your needs are. To me, when I looked at that black and white curve, the black and white lines originally, this aligns more with what I thought, funny enough. But then I've asked others in preparation for the presentation, and they said the traditional lined up more with what they thought. That's kind of interesting.

That is it for that section. I don't, I haven't explored this too in depth. I do have another project I'm working on with a colleague to build this out in more detail, comparing the different approaches to see if this curve clustering idea is something that we want to adopt and replace, or have in addition to our traditional clustering processes.

Final Q&A and resources

Awesome. I know on the topic of packages, because you just mentioned FDA, someone else had asked, are there other packages that you could recommend to us for functional data analysis?

Yeah, absolutely. FDA is classic. Matthew mentions the refund package. That's really good. There's a supplemental package to FDA called fda.usc, I think, and it gives you a little bit more functionality. For example, in FDA, you can average the curves through the mean with this supplemental package. You can do the median and a bunch of other stuff. There's a couple clustering. Actually, if you wait a second, I have resources. I would recommend to visit the R-CRAN task view. This is really neat. They have a running list of all available functional data analysis libraries in R. That's how I discovered the clustering portion. There's a small snippet there that describes the different techniques for curve clustering.

I would say just visit that, explore. I got started on this journey in school. Actually, we were learning about splines and GAMs, generalized additive models. I was really interested with the idea. I asked my professor where I can get more information about functional data analysis. My professor pointed me towards work by Professor Ramsey. I purchased these two books here, functional data analysis. It's theoretical. It's the math behind things. It's really good for understanding what the heck is going on in the background. For an applied perspective, functional data analysis with R in MATLAB is a great resource. I actually have it right here. I was doing something this morning with it. I reference that all the time.

There are also really great public posts. Joseph, I believe, is an RStudio employee. He has three or four posts on Rviews that are really good. There's an online course that's in-depth and quite lengthy. I'm sure it would cover your questions. If you're interested in GAMs specifically, there's a great course by Noam Ross on that information.

There are other books too that are really good that I haven't purchased. I've only stuck with these two. In summary, the R ecosystem for functional data analysis is really rich. If you want to do something, it's there. Then I do know there are a few resources in Python as well.

In summary, the R ecosystem for functional data analysis is really rich. If you want to do something, it's there.

Thank you so much, Santiago. That was awesome.

Yeah. One of Joseph's posts on Rview, the Rview's blog, shows a ... It looks like ... Shoot, I'm drawing a blank on the name, but it's a map of how the different packages relate to each other. At the center is the refund package that's getting a lot of love in the chat section, but also FDA. The FDA package and the refund package are pretty much the source of all newer works. That's really cool. I would say check that out. That's a really neat image.

Yeah. There's a lot out there. The ecosystem's so rich. Explore if you're interested.

The question was, on the weather time series analysis, based on what criteria will we give color? I think I get it. In the traditional clustering paradigm, you can ... Since our features are so related, you're dealing with collinearity, dimension reduction is almost a necessity. Using PCA on the features, treating them like independent features, and extracting some PCs is the way to go. I would say that I think that answers that question from a traditional perspective. From a curve clustering perspective, it's doing pretty much the same thing. It's doing like you've seen in the chat, probably a functional PCA. It's doing that on the background, in the back end too, and extracting some kind of variance and extracting the PCs based on some kind of variance metric maybe.

I see someone else said, I know the GAMS course or G-A-M. What is the difference between GAMS and FDA? Are they the same thing? Sort of. Yeah, I mean, yes, I guess in short, yes. GAMS use cubic spline so that if you remember that early math formula, that linear combination, if you think of like the regression component, you have your intercept and then you have beta, give your coefficient times some data, coefficient times. That phi function builds out the data component, the non-parametric data component, right? That non-parametric function. There are different ways to fit that in GAMS. I don't know if you can switch that. I've only ever seen it with cubic splines, in which case that is FDA, that is functional data analysis.

Let's see, you can generalize further and do maybe like a functional regression on your own using the functional data analysis techniques and then define the phi function as whatever you want it. Fourier, the Fourier series, wavelets, monotone, polynomial, whatever. But I think GAMS specifically use cubic splines.

The question was curious to know if anyone also tried using wavelet for curve modeling.

I haven't yet. In one of the textbooks by Professor Ramsey, he touts it as supposed to be pretty good. I haven't really worked with it too much. I've mostly stuck with the Fourier series, the Fourier basis expansion, and cubic splines.

For me to read, for me to read, I found the Fourier series, the Fourier basis functions work really well, except in rare cases where, like I've seen this in business customers more, where consumption can be, let me collect my thoughts, the Fourier series works really well for periodic beta, right, because it's built by sines and cosines, but it doesn't do very well when things are, like if something's jagged, it doesn't jump up very well, doesn't adjust up to these, there's a term, but I'm looking at like, if your consumption was like this, right, you go from nothing straight up, back down, these very jagged rises and peaks, Fourier series won't approximate that very well. Cubic splines would fit that much better, but there are downsides to that too, right, because you have to account for differentiability if that's what you're interested in, or the endpoints with cubic splines can be kind of wonky, so if you're trying, if you're doing something there, that might get weird.

The anonymous question was, they need to know how to change the data class for all variables from double to original class without doing it manually when they import a CSV file. Their further question was, they want to change everything for like 50 variables at once, and how to do that. Yeah, it's actually pretty easy, and all thanks to the tidyverse, funny enough, so shout out RStudio. Using stuff like the across function, dplyr across, you select your fields, for example, if you were interested in, if it was all fields, it'd be super easy. If you need a certain number of fields, you can do something like dplyr across columns, select where it contains, for example, if you have a set prefix or suffix or something, and then function would be whatever you need to adjust it by.

I, so as part of this, since I'm using actual customer information, I have to add noise to meet compliance standards and all that stuff, and I take, I do that very thing. I take, I read in the data, the raw data, and then I do, by using the tidyverse, I do an across function, apply noise, and then that just takes care of everything automatically. There's probably other ways to do it, but the tidyverse would be probably my go-to. It's easy.

And then another question was, how often do you need to refit the curves or clustering?

It depends. It's almost, if I were to generalize that a little bit further, if you were, depending on what you were doing, if you were doing a one-off analysis, then only once, right? If you were putting this into production to do something like predicting something, or clustering groups of customers, then you're entering a broader conversation about when do you retune models in general, right? So, if you're doing something in production, you should have some sort of like data drift type pipelines that assess as your data changed. There was a previous comment about outliers. If your input data has tended to be one way, and then all of a sudden just behavior changes, let's say, due to COVID or something else, then it's probably time to retune, right? You should have a flag that says, hey, just a heads up, or even break your procedure to say your input data is way different than what it used to be.

That's kind of a broader topic. There's no, I don't think there's a set defined period, but there's a lot of research out there to help guide that conversation as to when, how often, and it probably depends on the nature of your data, too. If you're doing something at the daily level, it's probably more frequent. If you're working on aggregated yearly meter reads, then once or twice a year would probably suffice.

Hi, yes, I was still thinking about the clustering, and I was thinking about what if there's a, there's sort of bias in the data in terms of for my hundred customers, I do not have data, the same amount of data. For one customer, I only have data for one month, and then maybe a couple of days in one month, and for some customers, I have every day for six months, and like for that variation, would that affect in the clustering?

It shouldn't. The functional data analysis techniques are really robust. In my case, since I'm, since I've typically worked with, like I'll prep the data, right, I will look at only customers who have had at least six months of meter reads. I'll eliminate anywhere where consumption is zero, because some people might just activate a meter, never use it, right, get rid of those people. So, if you prep, I've used, what do you call it, equally spaced basis functions, like the, the, they're called knots. The number of knots can be equally spaced. In my case, for example, in this, in the presentation today, there are 24, right, because I'm working 24 hours in the day. I define a knot at each hour of the day, but let's say in your case, you had month one, month two, month five, month seven, month 12, and you wouldn't use equally spaced knots, because you don't have them. What you might do is fit curves individually to each customer using whatever data you do have, whatever, like whatever periods you have, right, whatever knots you have, define those for each, each customer, and then work with those, the respective curves.

For, for example, with my work, since I don't have that issue, I can work with all the customers at once, and that, and that clustering, no, and actually, yeah, and that white paper I was telling you about, I'm working with, it's about 60,000 customers, and so it's a, it's a matrix of by 60, it's a 60,000, and by, not the other way, it's 24, and by 60,000 feet, right, so my customers are, my features are really wide, I can just treat them all at once, because the data is the same