Resources

Paige Bailey | Deep Learning with R | RStudio (2020)

Originally posted to https://rstudio.com/resources/rstudioconf-2020/deep-learning-with-r/ Paige Bailey is the product manager for TensorFlow core as well as Swift for TensorFlow. Prior to her role as a PM in Google's Research and Machine Intelligence org, Paige was developer advocate for TensorFlow core; a senior software engineer and machine learning engineer in the office of the Microsoft Azure CTO; and a data scientist at Chevron. Her academic research was focused on lunar ultraviolet, at the Laboratory for Atmospheric and Space Physics (LASP) in Boulder, CO, as well as Southwest Research Institute (SwRI) in San Antonio, TX

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

This is my first RStudio Conf, though I've been wanting to come for at least the last several years, or come to a RStudio event. I also feel a little bit like an imposter, so apologies for that. I'm here mostly to talk about the great work that folks here have done, and also how it integrates with the TensorFlow ecosystem and what we have coming down the pipe.

Full disclosure, I am a Python person, though I love R desperately, and I, prior to going to Google, I did R every single day, working as a machine learning engineer at Chevron and at Microsoft. So like I said, very excited to be here. And if you have questions, also very happy to answer them, either about the TensorFlow roadmap, about functionality, about what exactly tf.autograph and tf.function are doing, et cetera.

TensorFlow 2 overview

So TensorFlow 2, what is new? What have we been up to for the last year, and what do we have planned for next year? So as mentioned in the previous presentation, the biggest release that we had was focused on TensorFlow 2, and TensorFlow 2 was a move for eager execution, so opposed to this very complicated notion of building computation graphs in a static way, you suddenly get dynamic computational graphs. So you can do crazy things, like add two numbers together and immediately get a response, right? So the same kind of Pythonic way that we like to do things with scikit-learner, with R and carrot, you can start doing with TensorFlow.

We've also added a number of additional sort of products throughout the TensorFlow ecosystem, many of which are fast-accumulating R equivalents. So today, for example, we announced on the TensorFlow blog that there's something nifty called Keras Tuner that does automated parameter tuning. There is an R equivalent as of, I think, four hours ago yesterday at noon, so you can go and check it out as well. We should be featuring it on the TensorFlow blog in the next couple of days.

The topics that we're going to cover today are fully dynamic layers and models with Keras, as well as model subclassing, the built-in hyperparameter tuning with Keras Tuner, and then also custom training loops.

And I think also it's a little bit important to take a step back. How many folks in the room are doing machine learning today, like for your day job? Lots of hands. And then how many of you are using TensorFlow for deep learning within those machine learning jobs? Excellent. I'm assuming that everyone else is using kind of more traditional R packages like forecast or caret or those things of that nature.

And I think that's awesome in the sense that for most sort of data science and machine learning tasks, explainability is super important. And a lot of great business problems can be solved using more traditional methods, and they don't really need kind of the hammer that is deep learning to go about solving them. But for the tasks that deep learning is great for, there really is no comparison.

But for the tasks that deep learning is great for, there really is no comparison.

So you can see a few examples on the screen here. If you haven't taken a look at some of the demos that you can do with things like TensorFlow or PyTorch recently, there are things like object detection, where you can input an image and immediately get sort of a bounding box with a confidence level for different kinds of objects. Or you can sort of do automatic speech transcription, where you have an audio file and text is automatically generated from it. Or you can do text generation, where you have an input sentence, and then suddenly a completely reasonable additional second paragraph is added. So these are things that seem like science fiction, but they're really straightforward to do with existing deep learning models, many of which have been open sourced and are freely available.

TF Hub and the TensorFlow ecosystem

So you saw before an example of TF Hub. TF Hub is a collection of models that we use internally at Google. So those examples that I was telling you about before with speech transcription, with object detection, with sort of image captioning, where you input an image and it automatically generates a sentence describing what's in the image without you having to do any training whatsoever. This is a model that's freely available for you to use. All of those are available for free at tfhub.dev. They are all unsaved model format and they're all usable within the R ecosystem.

So TensorFlow 2. The official release is available today. TensorFlow 2.1 gives you support on TPUs and all of it's available within the R ecosystem. So if you install the TensorFlow R package or if you're using Anaconda on Windows, you can do install underscore TensorFlow and you're kind of off to the races.

I also want to make sure to recognize the great work that Sigrid has been doing. If you haven't taken a look at the TensorFlow RStudio blog, it is the coolest thing in the world. And it's also really interesting to see, like, every day a new blog post is released, it gets circulated almost immediately, especially among the TF probability team, and everybody has sort of excellent feedback about the R documentation, the R blog posts. Often the R documentation is better than the TensorFlow Python documentation.

So deep learning. TensorFlow 2.1 is a perfect time to get started. Especially because of this eager execution business that I was telling you about before. But also because the ecosystem is sort of brain bogglingly large.

So if you go to GitHub.com slash TensorFlow slash TensorFlow, what you'll see is that as part of the TensorFlow organization, we have a collection of 92 repositories. And each one of those repositories is a project. So TensorFlow probability would be one. TensorFlow hub is one. You can see things like TF agents, the TF model garden repo that we're building out. We have a project using Swift language bindings, TensorBoard, model analysis, fairness indicators. So building a collection of tools that you can use to make sure that your machine learning models are sort of responsibly and ethically making their assessments and their classifications. As well as add-ons, support for TensorFlow.js. Pretty much anything that you could possibly imagine.

We use TensorFlow extensively at Google in every single one of our products. So BERT, for example, was just recently rolled out to pretty much everything that I use at least on a daily basis. So Gmail, docs, sheets, et cetera. And so if you want something that's production tested and battle hardened, it would be hard to find an equivalent.

From TensorFlow 1 to TensorFlow 2

So TensorFlow 1 with the Python implementation was using wonky things called sessions. We've learned a lot since then. Sessions were a little bit frustrating. I'm not sure how many people in the audience are familiar with Python code. But the way that the logic for this worked was that you would define some variables. But you wouldn't be able to use them immediately. You would have to initialize something called a session. You would have to then initialize the variables. You would have to start a queue runner to be able to even understand what this input data would look like. You would have to create these weird batch sticks. And it just felt like a very wonky experience.

The transition from TensorFlow 1 is using Keras as the recommended high level API. Every single addition that we make to the TensorFlow ecosystem going forward is going to be Keras focused. So I think TF, the text preprocessing layers was mentioned in the previous talk. We're going to also see Keras integration with something called TensorFlow extended. So all of the tooling there will be supported as well. And you'll still be able to use low level operations. So if you need the kind of close to the machine manipulation, you can use that as well.

We've also removed duplicate functionality. We are trying to make consistent and intuitive syntax across the APIs. So again, focusing on Keras is the standard interface. And creating compatibility throughout the entire TensorFlow ecosystem with something called saved model. And you are able to output saved models with all of your R TensorFlow code.

TensorFlow and R

So just like TensorFlow loves Keras, we also love R and RStudio. So the majority of the development work that's been done to date on R-themed packages has been done by the great team here. And I am very interested in finding out how we can better collaborate with both our users and the people who are creating TensorFlow R packages. We're starting something this year called TensorFlow for Data Science. So building out a collection of products that if you aren't necessarily the kind of folks who want to be sort of generating novel model architectures from scratch, but you do want to be solving kind of the world's problems in an applied machine learning setting, how can we build better tooling for that? And that's partially Python, but it's a huge portion of R users as well. So be on the lookout for more information about TensorFlow for Data Science.

And then also Keras, the Python implementation looks quite a lot like the RStats implementation. You might see a few pipes, but for the most part, it looks pretty similar.

Symbolic vs. imperative API and saved model format

So technical differences in TensorFlow 2. We have something called the symbolic API as well as the imperative API. And the difference between the two is kind of it isn't immediately obvious, right? So if you're using the Keras functional API, your model turns into byte code, whereas sequential API, your model is more a graph of layers. And mixing functionality between the two is not at all recommended. So if you are intending to use the Keras API, I strongly suggest sticking with one and sort of staying with it.

For training loops and for kind of understanding the differences between the different Keras implementations, you might have heard of standalone Keras historically. That has since ceased development. So now all development work on Keras is sort of supporting the TensorFlow API backend. There's still the sort of historic implementation that supports things like Theano and CNTK. But for the most part, all development for Keras is going to be focused on TensorFlow going forward. And then the Keras R bindings for TF Keras are created by the good folks here at RStudio.

So building pipelines with TF datasets is like you would use pretty much the same style of interface that you've focused on before. So the reason why TF datasets exists is because if you have sort of complex performance hardware architectures, so if you have GPUs or if you have TPUs or TPU pods that you're attempting to saturate, you can't really do that with sort of the common NumPy interface or the traditional data pumping formats that you see today. So TF data kind of gives you the ability to build high-performance data input pipelines using just Python as opposed to having to hand roll these input streams in C++ or something a little bit more, at least to me, a little bit more frightening.

So TF data is sort of a platform agnostic data input pipeline that you could use with TensorFlow, but also if you have R code that you do want, you know, a different R package that does need a similar pipeline structure for its work, you could use TF datasets for that as well.

Moving along, compatibility with the TensorFlow ecosystem. So I mentioned before that we're trying to standardize the interface for TensorFlow, and we're really focusing on something called the saved model format. The saved model gives you the ability to have your model generated in R or in Python, packaged up into a friendly format that's kind of a defined directory structure that's also pretty human readable. There's a proto, some weights, you know, some additional metadata associated with the model, and you can take that and you can put it in a cellular device. You can put it on, like, an M series embedded device. You can use it in a browser, as a web application, or pretty much anything else. We've also started work on improving our C++ API for TensorFlow. You'll be seeing an announcement about that in a couple of months.

With sort of the roadmap for the C++ inference API, as well as the C++ API for training. And the idea is that all of it is focused around this concept of a saved model. And again, with your R code, you can generate this as an output and deploy your models anywhere that you could with a TensorFlow Python model.

We've also integrated TensorBoard into Jupyter notebooks. Jupyter notebooks have support for R, of course, if you wanted to use an R kernel, or you can also use TensorBoard within RStudio, I believe, with built in performance profiling, both for GPU and for TPU. And so under the hood, autograph and TF function, I mentioned this a little bit earlier, but what they're doing is even though we're giving you the ability to do eager computation, kind of, you know, the... You add numbers, you get a response, that sort of thing. Autograph and TF function sort of take whatever arbitrary Python code you've created and they build a graph out of it. So that means that your models, even though they're debuggable and they're easily understandable when you need them to be, they're also hyper performant if you need to run them on accelerated hardware or at scale using TF distribute.

Learning resources and closing

So learning more. Like I said, strongly recommend the tutorials and the blog posts on rstudio.com or tensorflow.rstudio.com. The book, sort of the canonical example is the reimplementation of Francois' initial book with JJ Allaire, Deep Learning with R. And if you have an idea for a TensorFlow R project that you would like to see funded and you're a university student or you know of a university student who wants to get paid for doing open source work, let me know and we can get it added to the TensorFlow Google Summer of Code project list.

So I think that is the question that I have for all of you today. So how can our team better support the needs of you guys and, you know, if you want additional documentation, if you would like to have sort of, you know, a better C++ API to provide more efficient bindings, just let us know and we can discuss. My email is page spelled like my name at Google.com, because I hear webpages do pretty well at Google. And that is all. So thank you so much.

Q&A

So the most popular question probably is not going to be a surprise. Does Google hire R data scientists? Oh, that's cool. So yes, absolutely. So I actually... When I got to Google, the first thing I did was find out where R code was being used. Google has this thing called code search, which is a massive monorepo, so you can access the code that's used to create, you know, like I said, Gmail, Hangouts, whatever. So if you want to find all of the Easter eggs in Hangouts chat, you can find it or create your own. But, yeah, there's a ton of R being used at Google. And many of the teams that are using R are especially interested in sort of transitioning from these more traditional machine learning approaches to using deep learning. And so they hire R data scientists definitely all around Google. Also statisticians.

And also, like, I know that Kaggle recently had a data scientist who has since migrated to go work on really nifty natural language processing tools. But she was one of the best sort of R aficionados that I have met in my career. So yes. That was a long-winded answer to a very short question. But an interesting one.

The ability to interpret models has been in the news a lot. Oh, absolutely. So there's a question. Are there any plans to integrate deep learning model interpretability methods like integrated gradients, deep lift, and or smooth grad as a functions and base TensorFlow? So I'm not sure about base TensorFlow. But I do know that explainability is one of the things that we are most sort of most focused on for Google as a company. And then also for TensorFlow as a product. So for every machine learning model that we roll out at Alphabet, it has to go through an extensive review process, both for fairness and for explainability. And we have a collection of tools that are available to help with this.

So fairness people in AI research. It's a team from led by Fernanda Villegas. But they have a number of open source tools available for model interpretability for sort of explaining fairness. There's this great thing called the what if tool that allows you to understand what data, even if you don't necessarily notice that it could be biased, is actually impacting your model results. And then the nice thing, too, is that it's integrated into TensorBoard. So if you want to be able to use it with the R models that you've created, you would be able to do that, both by doing the facets deep dive, as well as comparing models and comparing data points.

So I guess the TLDR is I don't know if we're going to put it in base TensorFlow. If anything, we're trying to get stuff out of base TensorFlow so it's more performant and it's less expansive and easier to learn. But all of these things are certainly available as additional packages that you can use in conjunction with TensorFlow.