Resources

Machine Learning with R and TensorFlow

J.J. Allaire's keynote at rstudio::conf 2018 on the R interface to TensorFlow (https://tensorflow.rstudio.com), a suite of packages that provide high-level interfaces to deep learning models (Keras) and standard regression and classification models (Estimators), as well as tools for cloud training, experiment management, and production deployment. The talk also discusses deep learning more broadly (what it is, how it works, and where it might be relevant to users of R in the years ahead). Slides: https://beta.rstudioconnect.com/ml-with-tensorflow-and-r/ JJ Allaire: - https://github.com/jjallaire Twitter: @fly_upside_down https://twitter.com/fly_upside_down Related blog post: https://blog.rstudio.com/2018/02/06/tensorflow-for-r/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thank you very much, Hadley, and thank all of you for coming to the conference. I hope you had a great first few days and great night last night. It's a great privilege for us at RStudio to be here with all of you this week.

Today I'm going to talk about machine learning with TensorFlow and R. For those of you who don't know, TensorFlow is an open source project from Google. It actually came out just a little bit over two years ago. And from the day it came out, I have been excited about what we could do with TensorFlow from R. And for the last about 18 months, me and some others at RStudio have been working on R interfaces to TensorFlow, and I'm really excited today to share a lot of that work with all of you.

So I'm going to start by giving an introduction to kind of what is TensorFlow, what are the core constituent parts of TensorFlow, and how does it work at kind of a low level. The principal application, but by no means the exclusive application of TensorFlow is for deep learning. So I want to talk a little bit conceptually what is deep learning, how does it work, what is it useful for, not useful for. And then I want to get into the work we've done in R to provide interfaces to TensorFlow for doing both deep learning and for other types of machine learning.

What is TensorFlow?

So let's start with what is TensorFlow. And you might offhand just say, oh, TensorFlow, that's a deep learning library. It is that, but it's actually considerably more than that. TensorFlow is actually a very general purpose numerical computing library. And in R, we have a long history of providing interfaces to numerical computing libraries. The original S language was the motivator for it was to provide interfaces to Fortran numerical computing libraries.

And today in R, we wrap that code, we wrap code from Blasses, we wrap code from the Eigen C++ library, the Armadillo C++ library. In R, we love numerical computing libraries, and we love creating lovely interfaces to them from R. So this is a new numerical computing library, and it has some interesting attributes. First of all, it's open source, as I said before. Second of all, it's actually hardware independent. So a TensorFlow program can run equally well on a CPU and take advantage of all the cores on a CPU, and that's actually using the Eigen and Blass libraries that we already use from R. But it can also run on a GPU or multiple GPUs. And there's even a thing that Google has created called a TPU, a tensor processing unit, which is hardware that is designed to run TensorFlow programs.

So hardware independence is a really cool attribute of TensorFlow. Another really cool attribute of it for numerical computing is that it supports automatic differentiation. And that's used extensively in the deep learning parts of TensorFlow. The other thing is it was built from the ground up for deployment, and scalable deployment, so it supports distributed execution, it supports very, very large datasets. So it's a really cool numerical computing library that we can do a lot of things with in R.

And that's kind of the first bullet point of why our users should care. New numerical computing library, what can we do with it? One of the other cool things about it is that there are many built-in optimization algorithms that don't require that all the data is in RAM. So typically, the optimization algorithms we use require that we manage to get the whole dataset in RAM. But now we can have a 10 gigabyte dataset, and we actually feed data to models just a small batch at a time, and the optimizer will work correctly. So it lets us do modeling and machine learning on very large datasets without having to have a huge amount of RAM.

Another piece of TensorFlow that's interesting is that typically when we think about building a model in R and then we want to deploy it, we actually need to bring R code along with it in the deployment. But the whole design of TensorFlow is that when you build a TensorFlow model, you can deploy it entirely separate. You don't need R or Python or any other code. It's just a C++ runtime. So this is actually really cool, too.

Tensors and the TensorFlow graph

So some of the basics of TensorFlow, where does that name TensorFlow, what are tensors, what's flowing, just to give you a baseline of what we're talking about here before we get into talking about deep learning. So tensors, actually everyone here already works with tensors pretty much all the time. Tensors are just multidimensional arrays. So in R, the core data type vector and matrices, those are our tensors. So a one-dimensional tensor is just an R vector, a two-dimensional tensor is an R matrix, and then you get into 3D and 4D arrays, which R also supports.

A 0D tensor is a scalar, which R does not have a scalar data type, but you can think of it conceptually as a vector that's always of length one. So you're going to be dealing with tensors, and the good news is we already deal with tensors all the time. So there's really nothing new to learn there.

I want to give some examples of tensor data, kind of how it plays out in TensorFlow. And one of the things to remember, if you think about a data frame, each row is an observation. So there's always a dimension that's dedicated to observations or samples. So in the case of a 2D tensor, I can just take a data frame, turn it into a matrix, and now I've got a 2D tensor. That's something we're all familiar with in R.

Time series data, or sequence data, is an example of a 3D tensor, even though sometimes that's represented in 2D. You can think of a time series object in R. If you think about a time series, you're actually considering not just the observations, but what happens to the observations over time. So it's really a 3D entity that you're analyzing. It's the features, the time steps, and then groupings of those. Image data is an example of a 4D tensor. Again, you might think of an image as a 3D tensor. You'd say, well, it has height, and width, and depth, which is color channels, like red, green, and blue. That's 3D. But then if you consider samples, again, you end up with a 4D tensor. And similarly, a video would be like a 5D tensor, because it's multiple frames, multiple images.

So what is this flow business that we're talking about? TensorFlow programs are not just like an R script that executes. You actually build up a data flow graph. And what happens is those tensors, the data, flow through the graph. And the operations in the graph, and basically their operations in the graph. So an example of an operation inside a TensorFlow graph is like a matrix multiplication, or the addition of a bias term, or taking gradients, or applying some kind of optimizer. So those are like functions that operate on the data, and it's all put together inside a graph.

When you work with TensorFlow, you don't actually program. You can program the graph explicitly. There's interfaces to do that. But typically, you actually write much, much higher level code. And I'm going to explain this R code in detail shortly. But this is an example of a model that I've written in R. And that's, on the right, what it actually looks like as a TensorFlow graph. So the graph, when you're using high level APIs to TensorFlow, is basically generated. And you don't have to reason about it at the level of the graph. But the graph is still there.

Well, we can actually think of other examples that we have in R of these sort of intermediate representations. So you can think of like a Shiny application. When you build inputs, and outputs, and reactives, you're actually creating a graph. And then Shiny executes your graph in a way that's kind of optimal and efficient. So the fact that Shiny knows the structure of your program allows it to run your program in a way that's really optimal and straightforward. When you write a dplyr code that works against a database, it generates SQL. SQL is sort of this intermediate language that's then fed to a SQL query optimizer that then executes the query in an efficient way. So similarly, what Google wanted to do with TensorFlow was, if we can get the representation of your model or your program into a graph, then we can run it really fast.

We can run it in parallel. We can run it distributed. We can look at the operations in your graph and fuse them together when possible. And then the other thing is we can run this without R or Python, just with C++. So really, the benefits of the graph are all about this idea of portability, performance, and scalability.

So really, the benefits of the graph are all about this idea of portability, performance, and scalability.

So what are people using TensorFlow for? I'm actually going to talk about all of these in turn. They're all listed on our TensorFlow for R website. A lot of people are using it for deep learning, but people are also using it for a lot of classical machine learning. These are some of the examples written up, and these are all long-form blog posts that get into things people are doing. But you can see some text classification stuff, some examples of trying to do predictions on really noisy data. There's computer vision applications, time series applications.

And I think there's these examples by actually hopeful and confident the R community, when given access to this library, is going to do some really additional interesting things that surprise us. I have one example of a project called Greta, which is similar in aims to Bugs or Stan. The idea is to write statistical models and fit them with MCMC. And so the way Greta works is I write a statistical model in R, not in some kind of separate specialized language. And then that model is compiled to TensorFlow, a TensorFlow graph. And that actually uses the R TensorFlow package. And the benefit of this is, A, I'm writing in R instead of a specialized language. But B, because it's using TensorFlow, I can train it on really large data sets. I can train it with GPUs. I can actually deploy the model. So this is an example of something that really has nothing to do with deep learning that people have already done with the R for TensorFlow interface.

What is deep learning?

OK, so now let's get into a little bit of what, because deep learning is probably the main thing people do with TensorFlow. And so I think it's important to understand kind of what is it? What is it useful for? What is it not useful for? Should we in the R community, how much should we care about it? And kind of how does it work?

So at a really, really high level, deep learning is taking some input, transforming it into some output via successive layers of representation. And when I say this really abstract to say input and output, what I mean there is like observations or x data and predictions. So it's x to y, but how do we do that? We do that by taking original input, in this case, a grayscale image of a handwritten number 4. That's our input, and our output is a prediction about what the digit is, the number 4. We have to successfully transform that input until we get close to the output. That's the basic mechanic of how deep learning models work.

So I'm talking about layers. I showed four layers there. What exactly is a layer? And I'm not going to talk about the mechanics of, maybe some of you have seen this before, of neurons and how they talk to other neurons and all that stuff. You can really just think of a layer, and all you really need to consider a layer as, it's a function. It's a data transformation function. It does a geometric transformation of data, and it's parameterized by a set of weights and a bias, just like a linear equation. So you can think of each one of these as just a successive transformation of data. So that's all a deep learning model is, is taking some data and then chaining together a bunch of transformations of the data until we actually get an output, or in this case, a prediction.

So when I say representations, what am I talking about? We really want to take our data and transform it in a way that's closer to the domain of prediction that we're looking for. So this is a really simple example. I've got raw data. I've got a bunch of points, and I want to determine whether a point, be able to predict whether a point is black or white. If I change the coordinate system, all of a sudden, the problem becomes really simple, because everything with x greater than zero is a black point, and everything with x less than zero is a white point.

So the idea is we want to transform our data so it's easier and easier to get close to the actual output that we're looking for, the prediction that we're looking for. If you work with conventional machine learning models, you'll know this as sort of feature engineering, where we're trying to transform our data into a form that works better for the prediction task. And in deep learning, the feature engineering is actually done in the layers. The feature engineering is actually learned rather than hand coded.

So back to this example of the handwritten digits, and I'll explain a little bit later how this works. But basically, we're taking raw data, and these layers are trying to filter out irrelevant information, like how the image actually looks. And it's trying to find other things, like where are edges, what are the angle of the edges, and get closer and closer to the domain of output or prediction. So you can think of it as sort of an information distillation pipeline.

And so this gives us an intuition about where the word deep comes from. Why do we call this deep learning? It's not because it gives us deeper insight or deeper models. It's really just talking about the fact that there's multiple layers of representation. A traditional machine learning model might have some feature engineering, and then one or two layers. A deep learning model could have 20 layers, or 30 layers, or 100 layers. And so maybe a more accurate term for deep learning might be like layered representation learning, or hierarchical representation learning. The deep learning maybe implies more than it should, but it's really just talking about the stacking up of layers.

And so what has this method achieved? You can see from the list here, it's done really, really well on a lot of perceptual tasks, like speech recognition and image classification. And then those, in turn, have been composed together to do things like trying to build autonomous driving systems, to build reinforcement learning systems that play games. So it's achieved a lot, this fairly straightforward mechanism.

And to give you an intuition for why it might be able to do that, if you think about a paper ball that's crumpled up, and you think about trying to write a linear equation for how to uncrumple the paper ball, it would be really, really difficult to do that. But if you decompose it into a set of simpler geometric transformations, i.e. layers, then it's actually straightforward. So a human being might just take that paper ball, and in 30 or so really simple operations, uncrumple it. And that's really how deep learning models work, is that they learn these simple geometric transformations that compose together can do very, very complex transformations, and therefore solve very complex functions, or fit very complex functions.

Deep learning and the R community

So why should we care about this? I actually think the domains that deep learning is proven to perform well in are ones that are not often of interest to R users. So these perceptual tasks, like vision and speech recognition, and these reinforcement learning applications, they're not things that most R users are concerned with. So it may be that as an R user, you think it's cool that you can now do state of the art computer vision from R. But it's probably more interesting to say, does deep learning provide improvements on the techniques we have for our traditional domains?

And is there data that we analyze that has very complex sequence or spatial dependencies that's hard to model with traditional modeling techniques? Is there data that requires a huge amount of feature engineering that's potentially brittle to model effectively? So I think it's more interesting to think about, is deep learning applicable, and how, and when, to the things that we traditionally do? And I'll get into some examples of that in a little bit. But I think it's important to note that it's definitely proven to be effective at these perceptual tasks. But it's not yet proven that it's of widespread benefit in other domains, although certainly people are working hard at it.

How deep learning models are trained

So I want to talk a little bit about the mechanism of how these models are trained, what happens, and try to convince you that this is actually a pretty simple and straightforward mechanism. And it's actually kind of surprising that it's able to solve the kind of problems that it solves. So I'll start with a little bit of the basics of how machine learning algorithms work, and then give you an example of kind of how the training loop for a deep learning model works.

So just to define some terms, the way machine learning algorithms typically work is that you start out with a bunch of data. You have x, and you have a known y. And you basically just feed batches of that data into the model. And the model basically incrementally improves its coefficients by examining the prediction, saying how close was the prediction to the actual value, and then adjusting the weights. And the training process is a sort of iterative training process that takes advantage of a loss function to evaluate the model and then tweak it over time.

There's significant kind of differences in orientation between statistical modeling and machine learning. I'm not going to go into a lot of depth about that. I have some links here for you to take a look at. But I think one that you'll notice right away is that oftentimes statistics is focused on explanation and understanding. It's focused on kind of inferring the process by which data is generated. And machine learning really, in a lot of cases, is exclusively concerned with efficacy of prediction. Can we predict things? We don't need to explain or understand the phenomena. We just want to predict it. So as you're going to see, these deep learning models are total black box models. They do nothing to help you explain or understand phenomena. But they work well for prediction.

So let's take a look at a model. This is actually, and again, I showed this before. And I'll get into more detail on this later. Here's the definition of the layers of a model in R. And I'll get into what these are a little bit later. But there's different types of layers. I'm going to compose them together. And I'm going to hope that I can train this model to recognize handwritten digits. And there's an example of kind of what the model is going to learn throughout. It's going to learn these filters that help go from an image into a prediction.

So how does that actually happen? So when a deep learning model starts, it has all these different layers. The layers have weights. The weights, interestingly, are randomly initialized. So in the model at inception, it literally has random weights. So the predictions that it makes are garbage, because the weights are random. And so what happens is that input is fed into the model, and predictions come out that initially are wrong, badly wrong. But what happens is that the predictions are measured against the true known targets with a loss function. So we get this assessment of how good the predictions were. Again, initially, they're quite bad. And what then happens is that loss function is used to update the weights. And that is the job of an optimizer.

And again, as I mentioned before, these optimizers can work with just little batches of data, 128 elements at a time. They don't require that all the data is in memory. So we feed the data into the model. We find out how bad the prediction is. We use that to tweak the weights. And then we repeat that thousands and thousands of times until we have a model that performs in a satisfactory way. So the actual mathematical mechanisms that work here are really straightforward. And the basic mechanics of the whole thing are straightforward.

And this tweet is actually from the creator of the Keras library. And he's making the point that there's really nothing complicated going on here. But what happens is that once we take this simple mechanism and we scale it up, it ends up looking like magic. And we return to this geometric interpretation that we had earlier. It's basically a mathematical mechanism or machine for taking really complex manifolds of high dimensional data and uncrumpling them.

But what happens is that once we take this simple mechanism and we scale it up, it ends up looking like magic.

And the ideas behind this were actually originated 30 years ago. We kind of knew about this. We've known about this for a long, long time. Until about 2012, it actually didn't work very well in practice. And the trick, the thing that changed, is that it turned out that we needed to be able to have really, really large models with lots of big layers and train them on lots and lots of data. And so what happened is that GPUs allowed us to train much, much larger models. And the internet allowed us to collect a lot more data. And that's kind of what changed and made deep learning actually work, even though, again, it sort of was invented conceptually 30 years ago.

So what do I mean by sufficiently large parametric models? This model I just showed you, the grayscale digit recognizer, that model has about 1.2 million coefficients or weights or parameters. And that's actually a pretty straight grayscale digit recognition is a lot easier than a lot of computer vision tasks. So this one has 1.2 million parameters. A bigger model that can recognize all kinds of everyday objects and color images here has about 138 million coefficients or parameters. So when we say sufficiently large, that's what we're talking about. We're talking about potentially tens of millions of parameters that need to be learned. And you can see why these models have very little explanatory or ability to help with your understanding, because you're going to get 138 million parameters, and how do you interpret that?

Frontiers of deep learning

So computer vision is kind of the poster child for what deep learning has accomplished. There's a competition that ran from 2010 to 2017 called the ImageNet Challenge, where machine learning researchers tried to build models that would predict what images are. And the images were like dog, cat, table, chair. So when they started the competition in 2010, no entrants used deep learning, and they had 71.8% prediction accuracy. In 2012, the very first team that used deep learning entered the competition, and that team beat the rest of the field by 11%. And that basically ignited everybody to start looking at deep learning for computer vision. And by the time we got to 2017, in a period of seven years, the accuracy in computer vision went from 71.8% to 97.3%, which is phenomenal. That, more than anything, got people excited about deep learning.

And actually, the competition is not going to exist anymore after 2017, because this particular task is basically considered solved. So clearly, it works really well in computer vision and image classification, and you see a lot of applications for it there.

People are also trying to apply it to natural language processing, and it's not clear how good these techniques work. There's a lot of research going on. The plot on the right shows the percentage of papers at computational linguistics conferences that use deep learning, kind of over the same time frame, went from around 35% all the way up to 70%. So again, this could be it's a fad, it's trendy, everyone's trying to see what they can find out. But if you read this paper that I linked to, and these slides are going to be available after the talk, you'll see that people are making progress in natural language processing to do different things. Clearly, a lot of progress has been made in language translation.

Similarly, people are looking at really complex time series forecasting problems. This is a paper that actually uses convolutional neural networks, which are the same kind of neural networks that are used for images. And so it's looking for kind of spatial dependencies that are invariant across the whole time series. So they're trying to use techniques used for computer vision on time series. And in this case, the data sets they had in this paper, they were able to beat some benchmarks of conventional modeling. But I have not seen in time series. Again, it's not all of a sudden revolutionizing the field, and you should always use deep learning for time series. That's on the contrary. I think there are certain types of time series that it might work a lot better for. But that's really a matter of exploration and research.

There's a lot of work going on in biomedical applications. And this could be everything from patient treatment, patient classification, to actual fundamental biological processes. This paper just came out this month. And it's kind of a roundup of the work that's going on there. And I think they conclude similarly. There's not evidence we're revolutionizing this field or these fields. But there are promising advances on the state of the art worth paying attention to.

This is a very, very interesting example. This paper just got published this month. And it was a study that was done by Google, Stanford, U of Chicago School of Medicine, and the University of California, San Francisco. And they wanted to see if they could provide better predictions about patient outcomes based on electronic health records. And so what they pointed out is that typically, statistical models, electronic health records are really messy. They're uneven. There's different features in different patients. The data is collected in inconsistent ways. They're really messy. So typically, a huge amount of effort is taken to select out a bunch of features, clean up and normalize those features, and then do predictions using conventional machine learning or statistical models.

But what they point out in the paper is that when you do that, you're actually discarding a huge amount of data. All the data that you, all the features you excluded because they weren't consistent or coded consistently. Things like the handwritten notes of a physician's, it's all thrown out. So they said, what if we did this? What if we build a deep learning model that considered the entire medical record of every patient, including the physician's notes, and including the fields that are not normalized against each other, and see if we could actually beat conventional methods of doing prediction. And in this case, they did find that they could beat state of the art statistical models in a bunch of categories.

I don't think it was necessarily easy to get this result. If you look at the paper, Jeff Dean is on the paper. He's like the inventor of TensorFlow. He worked on this. So I don't know that anybody could just walk in and get these kind of results. But it's interesting that it's possible. It's a very, very different approach to prediction. It's basically saying that the feature normalization and feature extraction is going to be done by the layers of the deep learning model, not by kind of human reasoning about the data and human composition of like a sensible model. We're just going to learn what the model is.

There are quite a few problems with deep learning models. I've alluded to some of these already. The fact that they're black boxes and you can't interpret them excludes their utility for a whole bunch of things that we want to do with statistics and machine learning and R. They can be brittle. So if you look up adversarial examples, there's a really classic one where you see like two pictures of a panda. And to the human eye, they look completely the same. But one panda has been like tweaked in really subtle ways to force the model to have a totally different prediction. So people are trying to overcome adversarial examples with various techniques.

They typically need a large amount of data to perform well. Like I said, we've known about deep neural networks for 30 years. And they didn't really work very well until now. So they typically need a lot of data. Although there are ways to transfer knowledge from a model built on larger amounts of data to a model that's trained on smaller amounts of data. So there are ways to use deep learning for smaller data sets. But they typically need a lot of data. And they're very computationally expensive to train as well.

So this creates another problem, which I think is going to get worse and be annoying to all of us, which is that there's a lot of hype about deep learning. There's actually a lot of hype about AI, which as far as I can tell means I have an if-then statement. So it's AI. And unfortunately, the current tools for doing deep learning models have become so good that software engineers or even laypeople with no training in modeling or statistics can actually build these models. And they'll actually think that they've built a model that's really great. They'll say, oh, look, I've got 84% accuracy. I built this model. I know nothing about modeling. I know nothing about statistics. I know nothing about probability. But here's a model. And it's deep. So it's obviously better than whatever anybody else has.

But typically, you can't actually outperform traditional statistical modeling techniques without a lot of effort. So usually these models are going to be bad compared to the models that we build. And so we're going to have to deal with this. We're going to have to patiently explain why, well, maybe that model isn't the best one for us to use. I don't think this is reason for us to throw up our hands and say, we don't want anything to do with this. I think what we need to do is promote a more balanced and knowledgeable and nuanced dialogue about what these things are good for and what they're not good for.

R interfaces to TensorFlow

So kind of coming back to this idea of frontiers, many fields have a deep learning frontier that is not yet reached or even well approached, and in most of these cases, the traditional statistical modeling and machine learning methods are cheaper and more accurate. So it's really pretty, once you get out of computer vision and language translation, a lot of this stuff is in the research stage, it's in the speculative stage.

And we actually have a bunch of different interfaces, high level and low level. We have some tools that help you be more productive with your workflow and managing experiments, ways for you to easily use GPUs for training, and then hopefully an adequate amount of learning resources so that you can better understand how to use these tools in your own work. So top level, we actually have three different APIs for TensorFlow. One is the Keras API, which I'm going to talk quite a bit about. That's a very high level interface for defining neural networks. The R code that I've showed you so far has been from the Keras API. There's another API called the Estimator API, which does more classic classifiers, regressors, support vector machines, random forests, kind of classic statistical machine learning models for TensorFlow. And then the Core API is actually what Greta used. This gives you full access to the entire computational graph.

This maps out into a suite of, today, seven different R packages. Three for the interface to TensorFlow. There's a package for working with large data sets called tf.datasets. And then there's packages, as I said, supporting tools for deployment, for managing experiments, and then for interfacing with Google Cloud ML.

The Keras API

So let's talk a little bit about Keras. I wanted to make a little bit of a note about why Keras. And actually, Keras is being, at this point, especially over the last six months, is being promoted by Google as kind of the preferred interface to TensorFlow. Keras works great for end user applications. You can see this on the left is the Google search for deep learning frameworks over the last three or four years. And you can see that both TensorFlow and Keras are pulling away from other frameworks. And on the right, you can actually see the use citation of deep learning frameworks in research papers. And you can see while TensorFlow has the most citations, Keras, even though it's really great, high level, easy to use API, is also used quite a bit in research. So you can go pretty deep with Keras as well.

I want to walk through what Keras code in R looks like. And this is using that MNIST example, again, of handwritten digit recognition. The first thing we do is data preprocessing. And this really just has to do with reshaping and scaling data into tensors. You're almost always going to be dealing with, a lot of times, R matrices. But you're going to get your data off of disk or off of whatever, from a data frame, and you're going to turn it into tensors. So that's a deep preprocessing step.

And then you're going to define your model. So I've shown this before. This is the definition of a model, which basically says, what are my layers, and how are they going to behave? So that's the model definition. And then you're going to compile the model, which basically says, what's the loss function and optimizer that I want to use during training? And what metrics do I want to collect during training? So this is really the heart of making a Keras model. Define the layers and compile it.

I want to make a quick note about this second statement here, model compile. You'll notice I do not assign the result of that expression back to the model. I'm actually modifying the model in place. That's not conventional in R. Typically, in R, objects are by value. You modify the object, and then you return a copy of it that has the modifications. But Keras models are these acyclic graphs of layers. They are stateful. They get updated during training. I haven't shown this yet, but a layer can actually be repeated at multiple places within the model. So these Keras models are by reference objects. And so the code you'll see when we mutate them, when we compile them or fit them, it's all done in place.

So now training. The training process is done through the fit function. And we basically say, here's our data. And in this case, we're saying we're going to do 128 samples at a time to the model. We might have millions of samples. We'll just do 120 at a time. We want to traverse the input data set 10 times. That's what epochs is. Again, since we're going to iteratively learn on batches, it's not enough to see the data once. We're going to see it multiple times. And then as we're fitting, we're going to hold out 20% of the data and use that to validate that we're not just overfitting to our data set. We're not basically a deep learning model can actually just memorize the data. And so it's giving you a function that really isn't useful because all it's done is memorize the data set. So we hold out data. And we test it against that to make sure we're not just merely memorizing data.

So that's what training looks like. When we're done with training, you can see we assign back to a history object. And we can plot the history and see how the training proceeded and how our accuracy and loss improved or didn't improve during training. Later, we can evaluate the model, which is taking yet another set of data that we've held out, making sure the model has never seen this data. And then we test to see how good the predictions are. Was the model really an overfit to the data? Or does it generalize well? And then we can use predict or predict classes to actually generate predictions from the model.

So I actually want to show you a little bit of what that looks like in RStudio. This is a complete end-to-end script that I just showed you in slides. So again, when we start, we'll load the data. The MNIST data set's actually built into Keras. So I get the raw data in. And then I'm going to turn that data into some matrices that I'm going to be able to feed into the model. You can see on the side now I've reshaped the data into a bunch of matrices. And then I define the model. And then I compile the model. I can print a summary of the model to see how many parameters I have, what the layers are, and how they work. And finally, I fit the model.

And as the model's being fit, we're going to actually show you the training metrics in real time in RStudio. And this is actually extremely useful because a lot of times when you fit a model, you're overfitting, your accuracy is not improving, the model doesn't work. And it's really good to be able to visualize that as you fit it so then you can stop training and say, nope, that architecture doesn't work, those hyperparameters don't work, I need to try to do something else.

I can also plot the history. And you can actually get at the data frame behind that if you want to do various types of custom visualizations of your training history. Evaluate to see what my accuracy looks like on data I haven't seen before, and then generate predictions. So that's just the basic mechanic. Models can get more complicated, but some Keras scripts are really this simple. This is all you do. Not to say that the task of actually designing and building and getting the model to work is simple. A lot of times, that takes a huge amount of iteration.

Layer types in Keras

So layers is kind of the core construct in Keras. And there's actually 65 different layers available, and you can create your own layers. This is just a sampling of some of them. I'm going to talk categorically about some of the types of layers that there are.

There's dense layers, which are kind of the staple of neural networks. These work the same way that traditional neural networks that were written about and used in the 70s work. And they're kind of a staple of every deep learning model. Usually, if you're using more sophisticated layers at the end of the model, you'll have these dense layers. And these dense layers are really just a bunch of weights and biases that are applied and cascade through other dense layers.

Convolutional layers are used for computer vision, most commonly. And they're basically trying to find spatial features. And they're trying to find spatial features that, when they learn, like, for example, what an edge is or what an eyeball is, that knowledge is actually transferable to other parts of the image. So if I see an eye in a certain quadrant of the image, and then I see later on an eye in another quadrant, I can still recognize that as an eye. So the reason it's convolutional is we're building up these filters that recognize patterns. And then we're convolving, sliding the filter across the entire image.

Recurrent layers are basically layers that have some memory or state. And so if you're dealing with sequence data, like time series data or text, where it matters not only what I'm seeing now, but what I've seen in the past, recurrent layers are able to maintain state. So they're oftentimes used in these sequence-oriented applications. And then embedding layers, for those of you who do natural language processing, is a way of vectorizing text. So one way to vectorize text would be to say, I've got an instance of the word cat, an instance of the word dog, and they're just classes. It's just cat versus dog. But another way would be, I've got a vector that represents semantic relationships between words. So in one of the dimensions of that, one of the axes of that vector, I've got a thing that says, oh, cat and dog are both animals. So embedding layers do this kind of vectorization.

And you can actually learn the embedding jointly with building the model, or you can load a pre-trained word embedding that we conventionally use, like word2vec. You can load one of those embeddings into an embedding layer. So there's just a few of the layer types, but again, there's 65 layers. And a lot of what doing deep learning is is trying to figure out what is the right composition of those layers and behavior of those layers to get a model that works well.

Workflow tools and GPU training

Kind of first and foremost is using GPUs. Some deep learning models work great on CPUs. Ones that use convolution or recurrent neural networks tend to be very slow on CPUs. So if you go to our TensorFlow for R website, there's a bunch of resources about different ways you can use GPUs. One is if you have a high-end NVIDIA graphics card, you can use a GPU on your local system. There's cloud services that let you do batch jobs that use GPUs. You can set up a cloud server with RStudio Server on Google Compute or Amazon or Azure. And there's even ways to actually have a virtual cloud Linux desktop that has a GPU.

We have another package called tfruns, stands for TensorFlow runs. And the idea here is that because a huge amount of experimentation is required with deep learning models, you're going to run a ton of training jobs. Instead of using the source function to run those jobs, you use a function called training run. And what training run does is it records everything about that job. The source code, the output, the performance of the model, the hyperparameters you use for the model. And then you basically have a data frame at the end for all the model runs that you've ever done. And then you can compute on that data frame to try to understand better what's working well and not working well.

It also gives you reports on each runs that summarize, again, the code, the model, the metrics, the performance of the model. So every single time you do a training run, one of these reports is generated. And then you can compare runs. So if you find, okay, I've examined my data frame. I found that these are my best two performing models. What was different about them? And you can see in this case, it's showing a source code diff saying I changed two things in the source code. And that accounts for that difference in performance.

And there's another piece called flags, which basically says if there's important aspects of my model, it's good to externalize those, get them out of my source code so that I can kind of pass these flags to my script as a parameter. So you can see I'm defining a set of flags here. And then I'm using the flags when I define a layer. And then when I train the model, I'm passing one of the flags. They're basically kind of command line parameters for your script that let you easily vary things.

And then that in turn enables this construct called a tuning run that says, okay, I've got two types of dropout. In this case, I'm gonna vary dropout from 0.2 to 0.4. And I want you to try every combination of those two hyperparameters and then tell me what the best model is. And if the number of combinations gets very, very large, you know, into the thousands, then you can actually sample from that and only run 50 training jobs rather than thousands of them to try to evaluate what the best model is. But these sort of tools end up being critical to getting deep learning models that perform well because you have to try so many different things.

The Cloud ML package is our interface to Google Cloud ML. And Google Cloud ML basically lets you train models, sort of, you sort of submit batch jobs to Cloud ML and then the model trains. And what's really cool about it is that they have GPUs and they actually have some really, really fast GPUs and even machines that have multiple fast GPUs available. And they also have a service for doing this hyperparameter tuning that I just talked about. So the idea is you want to work from the comfort of your laptop with smaller down-sampled data. And then when you want to actually do your training, that's going to take three hours. You submit it to Cloud ML and then it takes 20 minutes, hopefully, instead of three hours.

Deployment

I talked a little bit earlier about how TensorFlow models can be deployed without an R runtime dependency. So the tf.deploy package has tools that facilitate doing that. But the idea is you get a model, you export it, and then you can, for example, serve that model using a REST HTTP API. You can serve it using an open source project that Google has called TensorFlow Serving. You can serve it from RStudio Connect. You can serve it from Cloud ML. So these models, the nice thing is these models can be served from a whole wide variety of different sources. And again, there's no R runtime dependency for serving the model.

It's very easy to embed these models inside Shiny applications. Couple examples of that on the gallery. And then what's really cool is you can actually deploy these models not just to servers, but you can deploy them to embedded devices. You can deploy them to mobile phones. You can actually deploy them into the browser. There's a library called Keras.js and a library called DeepLearn.js that lets you take TensorFlow and Keras models and run them directly in JavaScript. So this idea that a model is this graph that's independent of the programming language used to create it is a really, really powerful one, and I think, hopefully, one that we can take a great advantage of in R.

So this idea that a model is this graph that's independent of the programming language used to create it is a really, really powerful one, and I think, hopefully, one that we can take a great advantage of in R.

Takeaways and resources

So what are the big, big takeaways that I hope you internalize a little bit from this talk? One is think of TensorFlow not just as a library that people use for deep learning. It is a general-purpose computing library, and it has lots to offer us, and we should be exploring those things like what has been done with the Greta project. Deep learning has made a lot of great progress in some fields, and a lot of people are working hard at making progress in other fields, and it's likely to increase in importance in those fields. It's hard to say exactly how much, but I think it's something that a lot of people are working on. It's gonna be relevant to our work,