Earlier this week, Google released TensorFlow, an open source library for numerical computation. Given the general frothiness around machine learning, we thought folks might appreciate a simple, straightshootin’ take from indico’s Machine Learning team. Unlike a random person on the Internet, we deal with this stuff daily, and can hopefully shed some light on how this works in an environment where machine learning code is built and shipped constantly.
First, the code (check it out on Github)
We’re big fans of Theano, and folks seem to enjoy Alec’s tutorials showing how to use it to implement machine learning models, from simple linear regression to convolutional neural networks. Since we were curious to see how TensorFlow compares, Nathan reimplemented the same model progression using TensorFlow. Check it out, and let us know how you like it!
So, why is there so much buzz about TensorFlow?
- Because Google released it, without much forewarning. By most accounts, Google is on the frontier of machine learning in terms of creative research, infrastructure, and ML-based products. It’s cool to peek under the hood.
- Because the hype machine for “deep learning” is at full throttle. All evidence suggests TensorFlow is exactly what the page says: “a library for numerical computation”. Nevertheless, it seems that people cannot help but make exaggerated claims on the Internet. One particularly egregious example from Slate claims, “Google built a shiny new brain to make all our decisions for us…”. In case you’re wondering, TensorFlow is, in fact, not that.
- Because it looks like a really nice implementation; the kind of output that happens when a good engineering team takes the time to try a lot of things, learn from those experiments, and re-implement the good stuff.
Let’s dive in.
Which technical features in TensorFlow are most compelling, and why?
The pace of development in machine learning (and especially deep learning) is rapidly increasing, and we are thrilled to see organizations like Google participating in the open source community. When good ideas from academia and high-performance distributed computing are implemented in a promising tool like TensorFlow, it bodes well for the future!
Before we call out some of the features from TensorFlow that are particularly relevant to deep learning, it is worth emphasizing that the most compelling thing about TensorFlow is the usability and architecture of the project. Even if no individual piece were revolutionary, the fact that all of the pieces work together to let us compose, compute, and visualize models is a real differentiating factor. Much the same way that Amazon’s EC2 itself isn’t revolutionary, the fact that it comes bundled with the suite of AWS niceties makes it an excellent product.
Here are some especially promising features:
Using the abstraction of computation graphs, TensorFlow maps the required computations onto a set of available devices. Graph and queue abstractions are powerful here, and there are many ways to solve the problem of allocating resources to the computation. TensorFlow implements what looks like a pretty sophisticated simulation and greedy allocation algorithm with methods to minimize communication overhead between resources. Other open source libraries, if they even allow you to use more than one compute device, tend to rely on the user to allocate resources statically.
Why is TensorFlow more promising? The obvious thing is scalability across distributed (possibly heterogeneous) resources, as in a datacenter or cloud deployment. The more subtle consequence is the allocation algorithm frees us from having to manage CPU vs. GPU devices, even on a single workstation…they’re all just resources to be used as greedily as possible.
Queues that allow portions of the graph to execute asynchronously.
This looks particularly useful for pre-fetching the next batch of data while the previous batch is computing. For example, using Titan X GPUs (indico’s weapon of choice) disk I/O is often the limiting factor for some of our models. Although we work around this limitation using threaded I/O, the TensorFlow approach looks even more robust. In addition to being conceptually simpler, putting data manipulation on the computation graph allows for better device utilization.
Visualization with TensorBoard.
As models get more complex, it is all too easy to skimp on model inspection and the practice of validating intuition. We believe visualization is really fundamental to the creative process and our ability to develop better models. So, visualization tools like TensorBoard are a great step in the right direction. We hope this will encourage the machine learning community in general to validate model internals, and drive towards new ways to train models and inspect performance.
Computations expressed as stateful dataflow graphs.
This abstraction allows models to be deployed across heterogeneous resources without rewriting models. Using a single workstation, we can exploit both CPUs and GPUs. This has the added benefit of making it easier to deploy to a heterogeneous compute environment (cloud, datacenter, etc).
TensorFlow is designed to work on a wide variety of hardware platforms ranging from high end multi-GPU rigs to smart phones. This enables developers to build and deploy machine learning applications on mobile devices. Advanced neural network applications such as language translation can be available without an internet connection.
Some incremental improvements that look useful:
- Constraints on resource allocation. For example, limiting the execution of particular computations to a subset of resources that have certain GPU hardware.
- Model checkpointing. Not revolutionary, but it’s nice to have fault tolerance on long-running jobs.
Features that TensorFlow assimilates from other projects:
- Like Theano, TensorFlow expresses models as directed acyclic graphs (DAGs) that compile into kernels to be executed on CPU or GPU. Unlike Theano, the compilation times are quite fast.
- Like Theano, TensorFlow composes operations (Ops), but unlike Theano, TensorFlow makes a distinction between operations and execution environments. TensorFlow has one Op, or function block has a kernel for each device one wants to run on. Theano, on the other hand, creates generic Ops that are then replaced with device specific Ops by the optimizer.
- Like Theano, TensorFlow gives automatic gradient computations.
- Like Theano + blocks, TensorFlow has the concept of monitoring nodes at any arbitrary frequency to emit events into output logs.
- Like Caffe, the TensorFlow core is written in C++ with an eye for resource efficiency.
Things we’re still looking into:
- The Theano project has spent a lot of effort optimizing memory usage. Since GPU memory is precious, it remains to be seen how memory footprint of a Theano model compares to a similar model implemented via TensorFlow.
- The current open source version of TensorFlow doesn’t include the trace monitoring tool, EEG, but it looks pretty handy for evaluating performance. We look forward to it.
- Distributed Functionality. The open source version of TensorFlow currently does not support distributed functionality. Follow this Github issue for updates.
To sum it up…
TensorFlow looks like a really well-made library. As far as we can tell, it brings some welcome improvements, but it probably won’t forever change the way we do machine learning. It isn’t a new “brain” or general artificial intelligence system, and it doesn’t mean Google will be open sourcing all their internal resources. But this release and the generally positive reception has led to a lot of interest in adopting new conventions for composing models, visualizing internals, managing computational resources, and so forth. For better or worse, it also overshadowed a couple of more traditional releases from Microsoft and Samsung this week. Like many, we’re still doing some cost-benefit analysis, and really love seeing these high-quality tools becoming available to everyone.
These are exciting times!
If you’d like to chat more about the nitty gritty of machine learning, email us at firstname.lastname@example.org.