Here we introduce optical character recognition as a common benchmark task in modern machine learning, and show how to implement a simple model. Being able to experiment with machine learning models is the first step towards capability!
Scientific learning process –> machine learning process
In the previous post, we introduced machine learning as a principled way to implement the scientific thinking process using computers and data. In this post, we walk through preparing a dataset for training and evaluation, and also implement a simple model (nearest neighbor) to solve the optical character recognition task using MNIST data.
The machine learning process is a lot like the scientific learning process…
…if we replace “mental model” with “machine learning model”,
factor out the the process of observing errors (“loss”),
and use optimizers to update parameters towards a better model.
What’s the use of benchmarks?
For hundreds of years, scientists have appreciated how discoveries build upon the discoveries that came before, famously expressed by Isaac Newton:
“If I have seen further, it is by standing on the shoulders of giants.”
This is doubly true in the practice of machine learning. We build upon previous ideas and discoveries, and now with open source tools like TensorFlow, it has become easy to copy, share, and build upon an author’s actual implementations. This can be wonderful for quickly testing things, but high-level users risk losing touch with the details of implementation. There are two ways to protect yourself:
- Introspection. Learn the fundamentals of operation, and understand how your implementations work from the inside-out, by inspecting the code itself. The goal of introspection is to understand the purpose for each part of the system. This is the aim of these articles.
- External evaluation. Adopt a standard benchmark test, and evaluate your implementations against it. Anomalous results can reveal where an implementation has gone awry. Plus, when a community adopts a common standard, everyone benefits from having a reference implementation and knowing how it performs on a standard benchmark test.
Thus, benchmark tests help ensure you aren’t fooling yourself. They help you build intuition about what works and what doesn’t. They help you recognize some of the ways your system fails, so you can improve it. They help others put your innovations into context with what everyone else has done. And they are good practice even for seasoned practitioners :).
Optical character recognition is the task of reading an image of a written character and classifying it as the proper character.
The specific task of classifying images of handwritten numbers into one of ten digits [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] is particularly useful as a benchmark. But there are practical applications too: automated reading of U.S. postal codes from hand-written labels; scoring the responses from surveys, censuses, and clinical trials, reading address numbers; etc. Optical character recognition was the subject of much research in the 1990’s, but nowadays neural networks can classify handwritten digits more accurately than human readers.
A machine learning task is often coupled to a dataset, and datasets can be messy. Today you can simply download a clean version, but someone had to make the effort of assembling data, labeling each image with the known digit, and making it available to you for evaluation.
The U.S. National Institute of Standards and Technology has the mission of promoting innovation by maintaining standards to evaluate various technologies. To facilitate the evaluation of optical character recognition technology, in 1994, NIST released two reference datasets: Special Database 1 and Special Database 3. Each contained many binary (black-and-white) images of handwritten digits, along with labels of the true class. SD-1 was intended to be used for training, and SD-3 for evaluation.
However, as technology improved, Yann LeCun and colleagues noticed systematic weaknesses in the NIST SD-1 + SD-3 datasets. They released a modified version called “MNIST”. In addition to mixing SD-1 and SD-3 into better train/test sets, they also fixed a number of other “messy data” issues:
- Normalizing size to fit in a 20×20 pixel crop box while preserving aspect ratio
- Centering the 20×20 box inside a 28×28 field of view
- Antialiasing the black-and-white data into grayscale
- Guard against “data leakage” by ensuring that the handwriting of a given individual is not included in both training and test datasets. By exploiting data leakage, a system could theoretically learn to classify that writer’s handwriting style instead of classifying the digit itself, thereby subverting the task into a different one.
The result is a collection of images of 28×28 pixels, from approximately 250 different writers. Each image contains a view of a single handwritten digit, centered in the frame. Read more, see benchmark values, and download MNIST data: http://yann.lecun.com/exdb/mnist/.
Some example images could be more difficult to classify than others. So if you want to compare OCR systems, a fair evaluation would be performed on a consistent set of training and test data. There are many strategies for constructing train/test splits on a dataset, but for MNIST, the split has already been performed: 60,000 examples for training and 10,000 for testing. So assuming no shenanigans, you can compare the results of your system to anything else, using the standard train/test data.
Researchers have been publishing MNIST results for years, which is a big reason why MNIST is such a useful benchmark. This helps ensure we aren’t fooling ourselves, and establishes a common point of reference for metrics. Like any test, MNIST has issues, but considering all the factors, it is a very useful benchmark task.
There’s an official TensorFlow tutorial for reading MNIST data, because it is such a common thing to do. But when you are solving novel problems, you won’t have official examples to follow. So let’s do it from scratch (and you can always check your results against an official example). We’ll be stepping through a similar logic as in this TensorFlow contributor module.
Step 0. Obtain the data.
This step is (artificially) easy, thanks to LeCun et al. Go to the MNIST Database, and download the four files to your machine. When you are solving novel problems, this will almost certainly take much longer, and involve several iterations.
Step 1. Unzip and inspect the data.
In what format are the MNIST data? Using linux or OSX, we can inspect both the metadata and the contents of the data. What kind of file does the filesystem think this is?
file train-images.idx3-ubyte –> “data”
file train-labels.idx1-ubyte –> “data”
The filesystem has no idea. Let’s inspect the contents of a file:
head -n 3 train-images.idx3-ubyte –> jumbled mess
head -n 3 train-labels.idx1-ubyte –> jumbled mess
OK, when the shell prints a jumbled mess and
file returns “data” as the filetype, we’re probably dealing with binary data. For binary data, to decode it, we need to know how the author wrote bytes into the file. When all else fails, read the instructions!
Step 2. Parse the data.
After returning to the MNIST Database, and scrolling to the bottom, we see that MNIST data are “stored in the MSB first (high endian) format used by most non-Intel processors”. But I’m using an x86 Intel processor, and you probably are too. Do we need to write a custom binary parser just to read the MNIST data? Feel free to write from scratch if you enjoy that kind of thing, but this is a solved problem. The python package
scikit-data (http://jaberg.github.io/skdata/) is a library of datasets for machine learning, including MNIST, for this purpose.
If you don’t already have
scikit-data installed, install it using pip :
pip install skdata --user
Now we load the MNIST data into python lists, one for images and another for labels.
from skdata.mnist.views import OfficialVectorClassification view = OfficialVectorClassification() train_idxs = view.fit_idxs[:] dev_idxs = view.val_idxs[:] holdout_idxs = view.tst_idxs[:] images =  labels =  for idx in train_idxs: images.append(view.all_vectors[idx]) # image labels.append(view.all_labels[idx]) # label
We now have a list of images, let’s verify that images and labels are correct.
print(labels) >>> 5
Good, this is a feasible label.
When we print the 28×28 array, it looks like a “5”, which is exactly what we expect. Think about how quickly and effortlessly your human visual system processed those 784 pixels when they were presented as a visual array! (Note: if you are working in a python or ipython shell, your print width might not happen to show exactly 28 elements per row, which is what you need to view something like above. Fix that using
np.set_printoptions(linewidth=148). You might have to adjust the value 148 to whatever value shows 28×28 on your particular display.)
Convert these 8-bit integer values into color (or gray) values using a colormap, we can print the grayscale image representation. Since the images above are currently loaded as one long array of 784 pixels, to display an image, we also need to reshape those vectors into 28 rows x 28 columns. Finally, by convention, a pixel value of 0 is shown as black, but handwriting is usually black ink on white background, so we’ll invert the colormap to show a more natural image (using
cmap= gray_r instead of
cmap = gray).
from matplotlib import pyplot as plt plt.imshow(images.reshape((28,28)), cmap = "gray_r") plt.show()
Quick intuitive feedback is wonderful for rapid experimentation! Look at the image…is it a “5”? Yes! We have parsed the data. It is good practice to do sanity checks like this, as you develop your data input code.
Step 3. Decide on train / dev / evaluation splits.
Now we need to put the data into a good format for feeding a model. As a general guideline, you want to have enough development (dev) and hold-out data to do fair evaluations.
- Training data: the examples you feed your system to learn better parameters.
- Dev data: the examples you evaluate periodically during the training process, as a sanity check. Also gives you a view into how much your system may be overfitting the training data. Your models will not learn from these evaluations, but you probably will. Some people call this “test” set, others call this “validation” set. Avoid the ambiguity by calling it “dev” or “development” set.
- Hold-out data: what you set aside at the beginning and evaluate only rarely, and never “in the loop” while training a system. Often called “test” or “validation” set as well, but “hold-out” is unambiguous.
Using skdata, we apply the following splits: 0:50k for training set, 50k:60k for dev, and 60k:70k for evaluation.
Add a few more lines of code, to parse train/dev data (changes in red):
from skdata.mnist.views import OfficialVectorClassification view = OfficialVectorClassification() train_idxs = view.fit_idxs[:] dev_idxs = view.val_idxs[:] holdout_idxs = view.tst_idxs[:] train_images =  train_labels =  dev_images =  dev_labels =  for idx in train_idxs: train_images.append(view.all_vectors[idx]) # image train_labels.append(view.all_labels[idx]) # label for idx in dev_idxs: dev_images.append(view.all_vectors[idx]) dev_labels.append(view.all_labels[idx])
We won’t be working with the evaluation dataset here, but you should see how the code above makes it easy to add. You could also shuffle indices within a set.
A simple model: nearest neighbor
Now it is time to think about how we will represent data in a model, so that we can put data into the proper format for feeding the model. For the task of optical character recognition using MNIST data, this is pretty straightforward.
Images are arrays of pixel values. MNIST images are 28 x 28 = 784 pixels where each pixel is an unsigned 8-bit integer in the range (0,255). Color images have multiple channels (e.g., red, green, blue) but MNIST images are grayscale, so we need only consider a single channel. Because we want to operate efficiently and reliably, we’ll cast each image to a numpy array.
Labels are scalar values, indicating which of 10 possible classes the current example has been classified.
Let’s start simple, and be very explicit about how our scientific thinking process translates into machine learning process. Given a bunch of labeled MNIST images, how would you make predictions for any given image? (Go ahead and write down some pseudocode for your strategy.)
What is the simplest useful model? A model called “nearest neighbor” doesn’t even require training, because it has no parameters to learn. Given a query image, a nearest neighbor model finds the most similar image in the database, and guesses the same label for the query image.
define NearestNeighbor function: for each image in training set: compute distance between query and image add distance value to a list find the image with lowest distance (highest similarity) to the query get the label of that image return label
Implementing the pseudocode using python + numpy is straightforward once we define a distance (or similarity) function. Use anything you want here. We’ll keep it simple by using a sum-squared error function, which takes the difference at each pixel location and sums across all pixels. The image with the smallest total pixel differences will be the most similar.
import numpy as np from skdata.mnist.views import OfficialVectorClassification view = OfficialVectorClassification() train_idxs = view.fit_idxs[:] dev_idxs = view.val_idxs[:] holdout_idxs = view.tst_idxs[:] train_images =  train_labels =  dev_images =  dev_labels =  for idx in train_idxs: train_images.append(view.all_vectors[idx]) # image train_labels.append(view.all_labels[idx]) # label for idx in dev_idxs: dev_images.append(view.all_vectors[idx]) dev_labels.append(view.all_labels[idx]) def sse_distance(img1, img2): """ Computes gross measure (sum-squared error) of similarity across pixels. Assumes MNIST input data (uint8 values) """ img1 = np.array(img1, dtype = np.int32) # big squared values exceed int8 range img2 = np.array(img2, dtype = np.int32) diff = img2 - img1 squared_error = diff ** 2 return squared_error.sum() def nearest_neighbor(query_image): """ Given a query image, returns the label of most similar image in the training set, using some similarity function. """ distances = np.zeros(len(train_images), dtype = np.int32) for i, img in enumerate(train_images): distances[i] = sse_distance(query_image, img) most_similar_idx = np.argmin(distances) return train_labels[most_similar_idx]
Time to test it on an image from the dev set!
q = dev_images # pick one print(q) print(nearest_neighbor(q))
This gives “9” (the correct answer) on an image from the dev set.
Evaluation + metrics
We have a working model, so it is time to evaluate whatever metrics are relevant for measuring the performance of our model on this task. We’ll use accuracy, which is the fraction of predictions that were correct. Simple and interpretable.
pred = np.zeros(len(dev_images)) for i, query_image in enumerate(dev_images): pred[i] = nearest_neighbor(query_image) acc = np.zeros(len(dev_images)) for i, label in enumerate(dev_labels): if int(pred[i]) == int(dev_labels[i]): acc[i] = 1 accuracy = acc.sum() / len(acc) print("Accuracy: %.3f%%, evaluated on %d examples" % (accuracy * 100., len(acc)))
When evaluating on 1000 examples from the dev set, this nearest neighbors model gets 95.3% accuracy. That seems pretty good, right, for such a simple model? Perhaps. Let’s put it into context.
Always evaluate things in the proper context
For a 10-class classification task, assuming the classes are equally distributed, a random guess would be correct approximately 10% of the time. But that’s a pretty low bar! We could probably exploit something about the raw pixel distributions (e.g., count the number of white vs. black pixels) and guess much better than that.
The pixels in these images are far from random, so we should expect even the most naive methods to do much better than 10%. The MNIST data have been preprocessed quite a lot (centering, re-scaling, etc). This probably helps the raw pixel distance function look better than it should, in terms of accuracy on the preprocessed data.
How does 95.3% accuracy (4.7% error) compare to published methods? Go to http://yann.lecun.com/exdb/mnist/ and scroll down to the table, where you can see nearest neighbors methods from 1998 achieved approximately the same 5% error rate. Great! When other implementations (written by totally different people) give similar metrics, we can have some confidence that our method is performing as expected.
Using better distance functions and preprocessing can help cut the error rate of a nearest neighbors model to less than 2% — even better for more sophisticated preprocessing/data augmentation like deformable edges. Woohoo! Now you know how to build a model that achieves interesting results on MNIST as of 1998. 😉
If we extend the nearest neighbors model to consider some arbitrary number (k) of closest examples, then we end up with a model called “K-Nearest Neighbors”. This is a common machine learning model, used by itself or in combination with other methods.
It’s fun to compare results. But the most important thing is you now have a reproducible end-to-end system for doing experiments with data! Further improvements will be incremental from here on.
Running the code above, you will probably notice how the runtime is quite slow. This is for two reasons: (1) This code is written for clarity and hackability, not for efficiency. Vectorizing the code across multiple processors would give a nice speedup, but make it harder to understand. (2) The nearest neighbors lookup is wasting a bunch of computations for each query, and there is no “compression of knowledge” whatsoever. Just a bunch of distance lookups.
Because nearest neighbor models must compute a distance for every example in the dataset, the expense of making predictions scales with the number of examples in your dataset. This is OK for dataset sizes here, but becomes a real problem with >millions of examples. Two common solutions are: approximated distance query (known as “approximate nearest neighbors”), and parameterized models.
Understand the errors -> better hypothesis -> better model
We need a better model, and that starts with a better hypothesis. How does our nearest neighbor model fail? It fails when the pixels of the query image are closer to the wrong label than to the right one. This isn’t really surprising because some numbers can be hard to distinguish (e.g., 7 vs. 1) from the pixels alone. For example, a single anomalous 7 example (e.g., with a very short angled top stroke) could be the nearest neighbor for several “1” images. Maybe if our model had some understanding of which pixels were important for a given class, it would make better decisions? In the next article, we’ll learn how to add parameters to our model, and implement a training function to learn the best parameters for a given task and data.