A survey of six months rapid evolution (+ tips/hacks and code to fix the ugly stuff)
We’ve been using TensorFlow in daily research and engineering since it was released almost six months ago. We’ve learned a lot of things along the way. Time for an update!
Because there are many subjective articles on TensorFlow and not enough helpful documentation, I’ve sprinkled in examples, tutorials, docs, and code snippets wherever possible.
Community engagement is the most important thing.
When it comes to machine learning, it is easy to focus on the tech (features, capabilities, benchmarks, etc). But good programmers know it is much harder to write code that humans will use, versus code that a machine can compile and execute. So my favorite thing about TensorFlow is the simple fact that everyone in the machine learning community is aware of it, most are open to trying it, and hopefully, enough of us will use it to make useful things. More minds solving problems, more shoulders to stand upon!
A large number of developers and students are now interested in deep learning because they heard about TensorFlow. Google Deepmind recently announced they’ll be migrating from Torch to TensorFlow, so we might see an uptick in TensorFlow reinforcement learning models being released in the near future, too. The future is bright when the community embraces openness, clean APIs, useful modules, and the attitude of being helpful on the internet.
Technical blocking factors have been mostly eliminated.
When we wrote the first post evaluating TensorFlow in November of last year, there were a number of real and potential blocking factors. I’m happy to report that most of these have now been solved.
- Multi-GPU support. It works; the documentation is simple and clear. You’ll still need to figure out how to divide and conquer your problem, but isn’t that part of the fun?
- Training across distributed resources (i.e., cloud). As of v0.8, distributed training is supported.
- Queues for putting operations like data loading and preprocessing on the graph.
- Visualize the graph itself using TensorBoard. When building and debugging new models, it is easy to get lost in the weeds. For me, holding mental context for a new framework and model I’m building to solve a hard problem is already pretty taxing, so it can be really helpful to inspect a totally different representation of a model; the TensorBoard graph visualization is great for this.
- Logging events interactively with TensorBoard. In UNIX/Linux, I like to use
tail -f <log_file>to monitor the output of tasks at the command line and do quick sanity checks. Logging events in TensorFlow allows me to do the same thing, by emitting events and summaries from the graph and then monitoring output over time via TensorBoard (e.g., learning rate, loss values, train/test accuracy).
- Model checkpointing. Train a model for a while. Stop to evaluate it. Reload from checkpoint, keep training.
- Performance and GPU memory usage are similar to Theano and everything else that uses CUDNN. Most of the performance complaints in the earlier releases appear to have been due to using CUDNNv2, so TensorFlow v0.8 (using CUDNNv4) is much improved in this regard.
Several high-quality metaframeworks.
- Keras wraps both TensorFlow and Theano backends. A good option if you want modularity without diving into the details of TensorFlow (or Theano).
- TensorFlow Slim is a great reference for image models. Even if you prefer to write your own low-level Tensorflow code, the Slim repo can be a good reference for Tensorflow API usage, model design, etc.
- Skflow wraps Tensorflow methods in a scikit-learn-style API. In my hands, it seems a bit awkward compared to just importing and inlining the python code for various sklearn metrics.
- PrettyTensor provides objects that behave like tensors and have a chainable syntax so you can quickly compose certain kinds of models.
Maintaining a popular open source project is a challenge, especially something with the technical complexity of TensorFlow. Hat tip to the maintainers! We appreciate their strategy of integrating new features and tests first so early adopters can try things before they are documented. Check out the version semantics note if you are interested in the details of what is released and when: https://www.tensorflow.org/versions/r0.8/resources/versions.html.
Tests are great!
Tests are valuable for validating functionality and for templating how things are supposed to work. When you find something in TensorFlow that isn’t working as you expect, or maybe you are learning the quirks of a method or arguments…search Github for a test, and see how the test does it!
RNNs are still a bit lacking, compared to Theano.
The Theano team has put a lot of effort over the years into optimizing their implementation of recurrent neural networks. Happily, the gap is quickly closing, and in a few months, TensorFlow may very well be the platform of choice for RNNs. Specifically:
- We haven’t seen an elegant way to handle variable length sequence inputs. Bucketing works, at the cost of extra complexity that most models just don’t need. Patching and padding all sequences to a fixed length works fine in many cases (especially using batches and GPUs), but some might see it as an unsatisfying hack. Dynamic unrolling for RNNs might be a solution, but the implementation of
tensorflow.python.ops.rnnmodule is new and undocumented. We’re still experimenting.
- Performance and memory usage. Although it is hard to do an exact apples-to-apples comparison here, after implementing many of same models in both frameworks, our impression is that, for RNNs, Theano is perhaps a bit faster and eats up less memory than TensorFlow on a given GPU, perhaps due to element-wise ops. Tensorflow wins for multi-GPU and “compilation” time.
Lack of authoritative examples for data ingestion.
The TensorFlow docs and examples focus on using several well-known academic datasets to demonstrate various features or functionality. This totally makes sense, and is a good thing to prioritize for general consumption. But real-world problems are rarely drop-in replacements for these kinds of datasets. Working with tensor inputs and shapes can be a real stumbling block when learning a new deep learning framework, so an example or two showing how to work with messy input data (weird shapes, padding, distributions, tokenization, etc.) could save a lot of pain for future developers/engineers.
Documentation can be inconsistent.
There are a number of good tutorials available for TensorFlow, and the code itself is very well commented (thank you, authors). But machine learning/deep learning is deep and wide domain, and there is a lag between new functionality and docs/tutorials explaining how to build stuff. A few of our favorites tutorials are:
- Nathan’s Github repo of simple tutorials. It’s the quick way to see machine learning primitives at work. Start here if you’re familiar with numpy or Theano.
- Udacity course by Google’s Vincent Vanhoucke. Start here if you’re new to deep learning.
- The official MNIST tutorial. Go here after the Udacity course if you’re new to deep learning. MNIST is the “Drosophila of machine learning” and a good benchmark and sanity check.
- Tensorflow API documentation. Our go-to reference for stuff in TensorFlow. Control-F to find stuff!
Unfortunately, especially for RNNs, there are still conceptual gaps in the documentation and tutorials, such as the gap between the simple or trivial examples and the full-on state-of-the-art examples. This can be a real barrier for developers who are trying learn the concepts at the same time as they are learning the framework. For example, the Udacity tutorials and the RNN tutorial using Penn TreeBank data to build a language model are very illustrative, thanks to their simplicity. They are good illustrations to learn a concept, but too basic for real-world modeling tasks.
The only other authoritative TensorFlow RNN tutorial that we’re aware of is a full-on seq2seq model using multi-cell RNNs (GRU or LSTM) with attention, bucketing, and sampled softmax. Woah! Just as you shouldn’t learn to ski by starting on the training hill then going straight to the top of the mountain to ride a double black diamond with trees and moguls (dangerous, and terrifying!?)…you probably shouldn’t go from the simplest implementations to the most complicated. Better to add complexity progressively, according to the problem you’re trying to solve.
High-quality tutorials that progressively ratchet up the complexity from simple RNN language models to something like plain seq2seq RNN encoder-decoder architecture that learns to reverse words, to a fancier neural translation seq2seq LSTM with attention, to something with multi-cell RNNs, bucketing and all the tricks would be extremely helpful to the nascent community of TensorFlow users. I suspect this lack of progressive examples might explain why the community has already reproduced many popular models in TensorFlow, but we haven’t seen many novel architectures or clever remixes yet.
Where documentation is lacking, look to the tests! Often the tests are more illuminating than the documentation anyway. Thanks to Google releasing the project as open source, you can search the Github repo for a relevant test to see how the authors do it.
We totally understand that the TensorFlow team is focusing on functionality and features first, and following thereafter with documentation…we’d probably do the same! Good docs are an investment, and the best docs I’ve seen are the result of someone who isn’t the author writing that documentation, because then you’re guaranteed that at least one fresh mind has understood the thing. It would be really cool if the TensorFlow community wrote documentation with as much urgency as they ask for new features!
We’re still waiting on the trace monitoring tool, EEG.
Heterogeneous resource utilization adds complexity.
A classic engineering tradeoff between control and simplicity—if you want fine-grained control over how your operations execute (e.g., which GPU node), then you need to maintain these constraints. In some cases, fine-grained control is necessary to maximize performance. For example, using multiple threads to fetch and pre-process a batch of data before feeding the GPU, so the GPU doesn’t wait on these operations. For more detail on using asynchronous runners on CPUs to feed GPUs, or to benchmark your own queues, see Luke’s excellent post, TensorFlow Data Input (Part 2): Extensions.
TensorFlow can hog a GPU.
Similarly, on startup, TensorFlow tries to allocate all available GPU memory for itself. This is a double-edged sword, depending on your context. If you are actively developing a model and have GPUs available to you in a local machine, you might want to allocate portions of the GPU to different things. However, if you are deploying a model to a cloud environment, you want to know that your model can execute on the hardware available to it, without unpredictable interactions with other code that may access the same hardware.
You can use something like the following snippet to put an upper limit on the GPU memory available to a given process, but if you have multiple GPUs on a machine, we’re not aware of a way to control allocation per GPU.
Set the option:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = 0.5)
and pass it to your session as a config:
sess = tf.Session(config = tf.ConfigProto(gpu_options = gpu_options))
By default, Theano and TensorFlow can conflict.
We have a lot of code that depends on Theano, from loading data to various utility functions. We also read a lot of research code that was implemented in Theano. However, if you import Theano and TensorFlow in the same scope, they will compete to allocate GPU memory and bad things happen. To execute totally different environments on different GPUs (e.g., two GPUs running two separate models), you can restrict CUDA to see only certain devices, at the shell/environment level. Then when you launch your python code, it will only see (and allocate) the GPUs that CUDA can see. If you use
bash, this will do the trick:
export CUDA_VISIBLE_DEVICES=0,1 # only the first two GPUs are usable
Note: the CUDA device numbers above might not be the same, as the device IDs you see using
Alternatively, if you want Theano to execute only on CPU, which is probably want you want for those data and utility functions anyway, you can do it inline in Python. Here’s a Python one-liner to do just that. Put this at the top of your imports:
import os os.environ['THEANO_FLAGS'] = "floatX=float32,device=cpu,fastmath=True,ldflags=-lopenblas"
Of course, you can inline the environment flags for CUDA too, but for my model development workflow, it is easier to remember “one GPU per shell”.
It takes a fair amount of effort to implement end-to-end workflows in any framework, and TensorFlow is no exception. Some things (queues, certain graph operations, resource allocation/context management, graph visualization) from TensorFlow are all relatively new to the deep learning scene and like many, we’re still learning the best ways to exploit these features. Other things have been available in other frameworks for some time. Even though the overall concept is similar, implementation details can differ. We appreciate all the effort Google developers have put into implementing good abstractions (e.g., streaming data from queues).
The best part of open tools is when someone from the community implements a really clever hack or novel way of solving a problem. Even though most folks are still climbing the learning curve with TensorFlow, I think the odds of that happening have gone up! Looking forward to the next epoch!
Got a question for our team? Feel free to reach out to us via email at firstname.lastname@example.org.