Dropbox Paper

Introduction to Machine Learning

In the summer of 1955, while planning a now famous workshop at Dartmouth College, John McCarthy coined the term “artificial intelligence” to describe a new field of computer science. Rather than writing programs that tell a computer how to carry out a specific task, McCarthy pledged, he and his colleagues would instead pursue algorithms that could teach themselves how to do so. The goal was to create computers that could observe the world and then make decisions based on those observations — to demonstrate, that is, an innate intelligence.

The question was how to achieve that goal. Early efforts focused primarily on what’s known as symbolic AI, which tried to teach computers how to reason abstractly. But today the dominant approach by far is machine learning, which relies on statistics instead. Although the approach dates back to the 1950s — one of the attendees at Dartmouth, Arthur Samuels, was the first to describe his work as “machine learning” — it wasn’t until the past few decades that computers had enough storage and processing power for the approach to work well. The rise of cloud computing and customized chips has powered breakthrough after breakthrough, with research centers like OpenAI or DeepMind announcing stunning new advances seemingly every week.

The extraordinary success of machine learning has made it the default method of choice for AI researchers and experts. Indeed, machine learning is now so popular that it has effectively become synonymous with artificial intelligence itself. As a result, it’s not possible to tease out the implications of AI without understanding how machine learning works — as well as how it doesn’t..

How Does Machine Learning Work?

The core insight of machine learning is that much of what we recognize as intelligence hinges on probability rather than reason or logic.

If you think about it long enough, this makes sense. When we look at a picture of someone, our brains unconsciously estimate how likely it is that we have seen their face before. When we drive to the store, we estimate which route is most likely to get us there the fastest. When we play a board game, we estimate which move is most likely to lead to victory. Recognizing someone, planning a trip, plotting a strategy — each of these tasks demonstrate intelligence. But rather than hinging primarily on our ability to reason abstractly or think grand thoughts, they depend first and foremost on our ability to accurately assess how likely something is. We just don’t always realize that that’s what we’re doing.

However, back in the 1950s, McCarthy and his colleagues did realize it. And they realized something else too: computers should be very good at computing probabilities. Transisters were still less than a decade old at the time, and housed in bulky vacuum tubes. But it was clear even then that with enough data digital computers would be ideal for estimating a given probability. Unfortunately for the first AI researchers, their timing was a bit off. But their intuition was spot on—and much of what we now know as AI owes to it. When Facebook recognizes your face in a photo, or Amazon Echo understands your question, they’re relying on an insight that is over sixty years old.

The machine learning algorithm that Facebook, Google, and others all use is something called a deep neural network. Building on the prior work of Warren McCullough and Walter Pitts, Frank Rosenblatt coded the first working neural network for machine learning in the late 1950s. Although today’s neural networks are a bit more complex, the main idea is still the same: the best way to estimate a given probability is to break the problem down into discrete, bite-sized chunks of information, or what McCullough and Pitts termed a “neuron.” Their hunch was that if you linked a bunch of neurons together in the right way, loosely akin to how neurons are linked in the brain, then you should be able to build models that can learn a variety of tasks.

[image of convolutional neural network layers here]

To get a feel for how neural networks work, imagine you wanted to build an algorithm to detect whether an image contained a human face. A basic deep neural network would have several layers of thousands of neurons each. In the first layer, each neuron might learn to took for one basic shape, like a curve or a line. In the second layer, each neuron would look at the first layer, and learn to see whether the lines and curves it detects ever make up more advanced shapes, like a corner or a circle. In the third layer, neurons would look for even more advanced patterns, like a dark circle inside a white circle, as happens in the happen eye. In the final layer, each neuron would learn to look for still more advanced shapes, such as two eyes and a nose. Based on what the neurons in the final layer say, the algorithm will then estimate how likely it is that an image contains a face.

The magic of deep learning is that the algorithm learns to do all this on its own. The only thing a researcher does is feed the algorithm a bunch of images and specify a few key parameters, like how many layers to use and how many neurons should be in each layer. The algorithm does the rest. It starts out by making a kind of educated guess about what types of information each neuron should look for, and then checks to see whether the guess actually worked. If not, the algorithm tries again with a different guess. If so, it uses what it’s learned to make an even more educated one. As the algorithm does this over and over again, eventually it “learns” what information to look for, and in what order.

What’s remarkable about deep learning is just how flexible it is. Although there are other prominent machine learning algorithms too — albeit with clunkier names, like gradient boosting machines — none are nearly so effective across nearly so many domains. With enough data, deep neural networks will almost always do the best job at estimating how likely something is. As a result, they’re often also the best at mimicking intelligence too.

Yet as with machine learning more generally, deep neural networks are not without limitations. To build their models, machine learning algorithms rely entirely on training data, which means both that they will reproduce the biases find in that data, and that they will struggle with cases that are not found in that data. (On the latter point, we can say they tend to memorize more than generalize.) Further, machine learning algorithms can also be gamed. If an algorithm is reverse engineered, it can be deliberately tricked into thinking that, say, a stop sign is a green light. Some of these limitations may be resolved with better data and algorithms, but others may be endemic to statistical modeling.

To glimpse how the strengths and weaknesses of AI will play out in the real-world, below is a brief description of the current state of the art across a variety of intelligent tasks.

Machine learning applications

Speech Recognition

Ever since digital computers were invented, linguists and computer scientists have sought to use them to recognize speech and text. Known as natural language processing, or NLP, the field once focused on hardwiring syntax and grammar into code. However, over the past several decades machine learning has largely surpassed rule-based systems, thanks to everything from support vector machines to hidden markov models to, most recently, deep learning. Apple’s Siri, Amazon’s Alexa, and Google’s Duplex all rely heavily on deep learning to recognize speech or text, and represent the cutting-edge of the field.

[audio file for google’s duplex here?]

The specific deep learning algorithms at play have varied somewhat. Recurrent neural networks powered many of the initial deep learning breakthroughs, while hierarchical attention networks are responsible for more recent ones. What they all share in common though is that the higher levels of a deep learning network effectively learn grammar and syntax on their own. In fact, when several leading researchers recently set a deep learning algorithm loose on Amazon reviews, they were surprised to learn that the algorithm had not only taught itself grammar and syntax, but a sentiment classifier too.

Yet for all the success of deep learning at speech recognition, key limitations remain. The most important is that because deep neural networks only ever build probabilistic models, they don’t understand language in the way humans do; they can recognize that the sequence of letters k-i-n-g and q-u-e-e-n are statistically related, but they have no innate understanding of what either word means, much less the broader concepts of royalty and gender. As a result, there is likely to be a ceiling to how intelligent speech recognition systems based on deep learning and other probabilistic models can ever be. If we ever build an AI like the one in the movie Her, it will almost certainly take a breakthrough well beyond what a deep neural network can deliver.

Image Recognition

When Rosenblatt first implemented his neural network in 1958, he initially set it loose on images of dogs and cats. A.I. researchers have been focused on tackling image recognition ever since. By necessity, much of that time was spent devising algorithms that could detect pre-specified shapes in an image, like edges and polyhedrons, using the limited processing power of early computers. Thanks to modern hardware, however, the field of computer vision is now dominated by deep learning instead. When a Tesla drives safely in autopilot mode, or when Google’s new augmented-reality microscope detects cancer in real-time, it’s because of a deep learning algorithm.

Convolutional neural networks, or CNNs, are the variant of deep learning most responsible for recent advances in computer vision. Developed by Yann LeCun and others, CNNs don’t try to understand an entire image all at once, but instead scan it in localized regions, much the way a visual cortex does. LeCun’s early CNNs were used to recognize handwritten numbers, but today the most advanced CNNs, such as capsule networks, can recognize complex three-dimensional objects from multiple angles, even those not represented in training data. Meanwhile, generative adversarial networks, the algorithm behind “deep fake” videos, typically use CNNs not to recognize specific objects in an image, but instead to generate them.