We visited the second annual RE•WORK Deep Learning conference in Boston earlier this month. In this debrief, I’m going to share my totally biased take on what was noteworthy from the conference. My observations will be in the context of trends in deep learning tech. As a bonus, since we were one of the few tech teams to attend both this year and last year, I’ll also share some brief comparisons of topics, audience, and applications.

Jump to:
Progress on challenging problems
Best tech presentations
Compelling use cases
Cool engineering
Market intel; trends observed in the past year

Noteworthy

Startups and big corps

As data and the value of specific use cases are discovered, machine learning startups are focusing on industry verticals. I talked with the founder of a fashion startup, taking photos of people wearing brand-name clothing and accessories, linking back to the brand’s product page. He’s very interested in finding “images in images”. A hedge fund was recruiting “the best and brightest” to apply machine learning methods to the task of analyzing market signals in satellite images and more. Our neighbors at Harvard/Connectome project are automating the neuronal segmentation problem using neural nets and lots of electron microscopy images, in their quest to map out neural connections in the brain. As you might expect, startups were exploring all kinds of tech solutions, from convolutional neural networks to recurrent sequence models, hybrid architectures, GPU hardware, domain-specific data, and engineering concerns.

Big corporations were also represented, with a strong and noticeable focus on autonomous vehicles and IoT/connected sensors. These conversations focused more on adding features to established products, infrastructure required to process these rich data streams, and ways to prove value for new tech and features. For whatever reason, there was a distinct lack of small- and medium-sized businesses.

Progress on challenging problems

Cocktail party problem

The “cocktail party problem” is a signal processing problem where multiple speech signals are mixed in a single channel, and the challenge is to separate the individual components (i.e., speakers) from the mix. John Hershey from Mitsubishi Electric Research Labs talked through their solution using embedding vectors, then played samples that sounded really good! Speech is just one of many kinds of noisy sequence, and it could be fun to explore other signal separation problems using a similar method. Nice talk, John!

Facial emotion

Daniel McDuff showed a demo where facial emotions were detected from video, using a webcam aimed at the speaker’s face. It’s been known for some time that convolutional neural networks are well suited to solve the task of facial analysis using labeled observations, but there’s a big difference between testing an academic benchmark and deploying an industrial grade engineered solution to a problem that matters. Thanks to Affectiva for showing a good product!

Yoshua Bengio: the keynote

Obviously, it was a real treat to hear from one of the people who re-ignited interest in neural networks. To my eyes, the presentation was pretty much as expected, surveying the state of the art in neural networks. The Q&A at the end was particularly good; thanks to Prof. Bengio for leaving plenty of time for questions from the audience. Here is my favorite of his talks on the Internet, where he lays out the problems and theory underpinning much of deep learning (Link via Montreal Deep Learning Summer School 2015.)

Best Tech Presentations

Honglak Lee: disentangled representations

Lots of great computer vision stuff from Prof. Lee. In particular, Lee et al.’s work on weakly supervised disentangled representations was very interesting, and I look forward to seeing more advances here, especially (hopefully?) in the context of generative models!

Andrew McCallum: structured knowledge graphs + neural networks

Before deep learning was a thing, Prof. McCallum was instrumental in developing conditional random fields. He talked about a universal schema using structured knowledge bases, a neat take on helping models exploit “what is known” about the world to make better predictions. He also talked about traversing graph structures as a sequence, and feeding that to sequence models like LSTM/recurrent neural networks—a known tactic that probably doesn’t get enough attention given the amount of knowledge locked in knowledge graphs.

Andrew Tulloch: deploying deep learning at Facebook

In a really solid technical talk, Andrew drew attention to some key technical issues that arise when deploying robust solutions using neural networks, from biases in data distributions to minimizing the memory overhead of convolutional neural networks at inference time. Because Facebook deploys models at a greater scale than most organizations, this was perhaps a glimpse into the future pain many machine learning engineers will encounter in the next few years. Fun facts: Facebook translates 414M documents, scores a billion news articles, and ranks trillions of ads…per day.

More generally, in science and engineering, there’s often a spectrum of reasons why a solution might be interesting, from “common-sense idea that just works” to “elegant idea”. Personally, I think novelty is overrated, and good engineering is too often overshadowed by academic algorithms and context-dependent benchmarks. It was refreshing to hear Andrew highlight a few practical and reliable paths to improving deployed models.

For example, ImageNet is a well-known dataset for supervised image classification, but the distribution of guitars in ImageNet (Figure 1) can be quite different than the distribution of guitars on social media (Figure 2).imagenet_guitars

Figure 1: Images randomly selected from the first page of result for ImageNet’s “guitar” synset. The images generally focus on a single kind of thing and have a clean background. If we ask a group of people “what is happening in this image?”, we expect everyone to give a similar answer, such as “it is a guitar” or “a person/baby playing a guitar”.

social_guitars

Figure 2: Images labeled as “guitar” randomly selected from social media (i.e., user-uploaded images). Compared to ImageNet, these images contain more complex backgrounds and multiple instances of objects and people. If we ask a group of people “what is happening in this image?” we might expect more variety in the answers: “it is an album cover”, “a guy trying out a guitar in the guitar store”, “a rock band”, “a living room with guitars in the corner”.

Differences in data distribution can present a real problem for someone who trains a model on ImageNet and expects to make predictions on real-world data (e.g., detecting guitars in user-submitted images). An easy fix is to merge and train on the combined set, but then you’ll incur the cost of labeling categories for a bunch of user images. For comparing algorithms on equal footing, benchmarks are important, but when it comes to building solutions for the real world, make sure your tech is built on data distributions that are representative of your use case.

Jianxiong Xiao: robot vision using real 3D models

When we look at benchmarks for 2D visual classification, it can be tempting to think “problem solved” when models meet or exceed human accuracy. But the truth is we live in a 3D world, and understanding interactions between things in 3D is something any autonomous agent must do. Prof. Xiao illustrated very clearly how 3D robotic vision and planning are still major challenges, and despite hype about artificial intelligence, the state-of-the-art here is still (hilariously) lacking when a robot struggles to turn a doorknob.

Prof. Xiao’s team is using fully 3D convolutional neural networks which look at multiple resolutions of input data to learn about 3D environment. Using a volumetric 3D representation of the world is perhaps an obvious idea, but adding an extra dimension to the representation causes a huge increase in the number of parameters for a model, which can make training and generalization difficult. Thus, tactics like weight sharing across the 3D spatial locations and multi-scale region proposals (see Figure 3) become even more critical. The Xiao lab seems to be cranking out good work, too, with big improvements since last year. Thanks for the good clear talks, and keep it up!
multiscale-3d-conv

Figure 3: Predicting 3D objects across multiple length scales. Image from: http://robots.princeton.edu/talks/2016_MIT/RobotPerception.pdf.

Compelling Use Cases

GumGum

These folks sell image-on-image advertising campaigns, like the ad you see at the top of this article.

Brands want to control how and when their ads are shown, both to optimize the performance of the ads (since the audience segment that views an image might be more likely to engage the advertised thing), and to protect the integrity of the brand (i.e., Disney might want to avoid images of people wearing just undergarments, but Calvin Klein might actively target them). But to give brands control over image context, you need to have a model that understands something about the context and semantics of the images you are using to define user context. GumGum uses convolutional neural networks to do this.

For example, a client wanted to market their product to a specific segment of women, and they wanted to find these viewers by targeting images with “bold lips”, like this image of Angelina Jolie wearing MAC’s Russian Red lipstick:
030512-angelina-lipstick-440

Figure 4: Angelina Jolie has “bold lips”. Image from: InStyle.

To the advertiser, the concept of “bold lips” is obvious, and it probably is to you too! As humans it is easy for us to focus on the “lips” part of a photo and evaluate the color there. But then GumGum’s pipeline kept returning images of a certain famous person:
harry-styles-bold-lips

Figure 5: Harry Styles has “bold lips” too. Image credit: Daily Mail.

Objectively speaking, is the model wrong to discover photos of Harry Styles with bold lips? Check out the GIF to compare! To my eyes, if we focus on just the lips in the images, then it is hard to “see” much difference aside from color (Harry doesn’t seem to be wearing lipstick). Using just the image data, there are many similarities!
bold-lips

Subjectively speaking, GumGum’s client didn’t want to spend their advertising budget to place ads in the context of Harry Styles images; they wanted to target images of women with bold lips. Thus, the model has also helped reveal hidden assumptions about data distributions. In this case, the client wanted to exclude the unexpected results, but we could imagine a situation where the unexpected results are very valuable. Neat use case for convnets!

Conservation Metrics: environmental surveys using distributed sensors + machine learning

My favorite use case of the day was from Conservation Metrics Using distributed sensors (such as far field microphones) and machine learning, they gather wildlife and environmental data to inform conservation efforts. Since 1970, we have lost half the wildlife on the planet, but there is no single cause for this. So it is hard to determine the impact of any single factor in terms of things like species populations or cost effectiveness of interventions.

For example, a rare species of bird called Bryan’s Shearwater was previously believed to exist only on the Midway Islands, in the middle of the Pacific Ocean. Three individual birds were found there, but biologists were not able to locate any nests or breeding grounds. Birds are extremely mobile, so without knowing the location of breeding grounds, it is difficult to protect them. Conservation Metrics placed acoustic monitors around the islands, and using software to filter, were able to hear the first calls of this bird. Expert biologists used this information to tune a machine learning model to detect the call of Bryan’s Shearwater birds (versus other birds). Using the data + models, they were able to triangulate the nests, and put a conservation area around those sites to protect them. In a world where innovative tech diffuses slowly into new domains, it was really refreshing to see such a sophisticated end-to-end system deployed and working, and with experts in the loop!

Other cool stuff

Joseph Durham: optimizing robotic operations at Amazon

The biggest fleet of robots in the world might be doing Amazon’s order fulfillment operations: thousands of (human) workers + thousands of robots moving tens of thousands of pods which hold millions of products in many warehouses around the world, like in this video. The problem of packing random items efficiently into a box is still a challenge for robots, and humans do that part. So, interestingly, that means the robots’ task is to fetch items for human workers…not unlike a queue for multiprocessing in software. This concept of organizing automation around the slow step is classic engineering, and it was great to see how it helps me get items shipped to my door in three days.

Spyros Matsoukas: speech detection with Amazon’s Echo/Alexa

Speech recognition is a classic problem which can be solved with hardware and/or software. Amazon’s Echo uses an array of far-field microphones, a hardware solution to make the downstream processing easier. But nine microphones per Echo, for each of the many Echo users, can send a lot of data! One of the things people don’t generally talk about re: “big data” is that most of the data you’re storing are crap and/or redundant. Amazon understands this and has engineered their speech processing pipelines to exploit redundancy, far-field acoustics, compression, and efficient models. Cool stuff.

Parsa Ghaffari: Aylien’s APIs for entity recognition

The founder of Aylien, another startup using machine learning to provide APIs, gave a good talk about how they think about problems like entity recognition using a variety of tech, including recurrent neural networks/LSTMs. Although they structure their API interfaces somewhat differently than indico, it is good for the world to have some diversity of choice here. Nice talk, Parsa!

Trends, from last year to this year

Race for talent is ramping up

Every non-academic speaker closed out their session with “by the way, we’re hiring”, but nobody sold it properly. To find top talent in a competitive market, you need to build relationships. If you won’t invest the effort it takes to build real relationships, or if folks don’t want to engage with you…surprise, you’re not getting top talent! As a bonus, our entire ecosystem wins when we invest in each other, regardless of hiring outcomes. Just like building friendships, it’s a strategy that never fails. There’s a lot of work to do, and huge impact to be made with machine learning tech. Let’s do it right and build relationships for the long term.

Hype overload? Not really

Given the media’s tendency to brand and exaggerate deep learning tech, I was half expecting to hear a bunch of talks promoting the speaker’s startup or brand. There were definitely a couple of those, but in general, speakers focused on tech and applications; there was a lot of good content and earnest conversations with other attendees.

More focus on domain-specific applications of deep learning

Last year, many speakers were clearly building on public datasets (ImageNet, COCO, etc) and discovering pros/cons of various algorithms. Since then, we’ve seen the state-of-the-art in image classification continue to improve (ResNets), many developments in recurrent networks, demonstrations that reinforcement learning works (AlphaGo), increased interest in generative models (DCGAN), and investments in hardware from nVidia and Google. Huge activity across the board! But specifically in the context of conference topics, the docket this year was about focused applications rather than academic benchmarks on general tasks, and I think this is a natural trend towards specialization that will continue in the years to come.

Metamind was acquired

Metamind had a strong presence at the last event, but following their recent acquisition by SalesForce, wasn’t at this event.

Beginnings of use cases in finance, cybersecurity, fashion

Use cases for deep learning models continue to emerge. Last year we saw talks about facial and emotion detection using images, this year we had talks about detecting “bold lips”.

Custom ASICs and GPUs

Nervana Systems had a strong presence this year, and I really enjoyed talking with them about their custom deep learning ASIC (application-specific integrated circuit) to be released in 2017, and their deep learning software platform, Neon. In related news, Google recently announced that they have custom ASICs already in production, and nVidia continues to ramp up compute power of high-end consumer grade GPUs, so competition in the hardware space is getting interesting.

That’s it, hope you enjoyed the read! Looking forward to another year of innovation!

Suggested Posts

Deep Advances in Generative Modeling

Machine Learning for Non-technical People (Slides)

The Founder's Guide to Machine Learning: Choosing the Right Service for Your Business