Moving machine learning from practice to production

With growing interest in neural networks and deep learning, individuals and companies are claiming ever-increasing adoption rates of artificial intelligence into their daily workflows and product offerings.
Coupled with breakneck speeds in AI-research, the new wave of popularity shows a lot of promise for solving some of the harder problems out there.
That said, I feel that this field suffers from a gulf between appreciating these developments and subsequently deploying them to solve "real-world" tasks.
A number of frameworks, tutorials and guides have popped up to democratize machine learning, but the steps that they prescribe often don't align with the fuzzier problems that need to be solved.
This post is a collection of questions (with some (maybe even incorrect) answers) that are worth thinking about when applying machine learning in production.

Garbage in, garbage out

Do I have a reliable source of data? Where do I obtain my dataset?

While starting out, most tutorials usually include well-defined datasets. Whether it be MNIST, the Wikipedia corpus or any of the great options from the UCI Machine Learning Repository, these datasets are often not representative of the problem that you wish to solve.
For your specific use case, an appropriate dataset might not even exist and building a dataset could take much longer than you expect.
For example, at Semantics3, we tackle a number of ecommerce-specific problems ranging from product categorization to product matching to search relevance. For each of these problems, we had to look within and spend considerable effort to generate high-fidelity product datasets.
In many cases, even if you possess the required data, significant (and expensive) manual labor might be required to categorize, annotate and label your data for training.

Transforming data to input

What pre-processing steps are required? How do I normalize my data before using with my algorithms?

This is another step, often independent of the actual models, that is glossed over in most tutorials. Such omissions appear even more glaring when exploring deep neural networks, where transforming the data into usable "input" is crucial.
While there exist some standard techniques for images, like cropping, scaling, zero-centering and whitening - the final decision is still up to individuals on the level of normalization required for each task.
The field gets even messier when working with text. Is capitalization important? Should I use a tokenizer? What about word embeddings? How big should my vocabulary and dimensionality be? Should I use pre-trained vectors or start from scratch or layer them?
There is no right answer applicable across all situations, but keeping abreast of available options is often half the battle. A recent post from the creator of spaCy details an interesting strategy to standardize deep learning for text.

Now, let's begin?

Which language/framework do I use? Python, R, Java, C++? Caffe, Torch, Theano, Tensorflow, DL4J?

This might be the question with the most opinionated answers. I am including this section here only for completeness and would gladly point you to the various other resources available for making this decision.
While each person might have different criteria for evaluation, mine has simply been ease of customization, prototyping and testing. In that aspect, I prefer to start with scikit-learn where possible and use Keras for my deep learning projects.
Further questions like Which technique should I use? Should I use deep or shallow models, what about CNNs/RNNs/LSTMs? Again, there are a number of resources to help make decisions and this is perhaps the most discussed aspect when people talk about "using" machine learning.

Training models

How do I train my models? Should I buy GPUs, custom hardware, or ec2 (spot?) instances? Can I parallelize them for speed?

With ever-rising model complexity, and increasing demands on processing power, this is an unavoidable question when moving to production.
A billion-parameter network might promise great performance with its terabyte-sized dataset, but most people cannot afford to wait for weeks while the training is still in-progress.
Even with simpler models, the infrastructure and tooling required for the build-up, training, collation and tear-down of tasks across instances can be quite daunting.
Spending some time on planning your infrastructure , standardizing setup and defining workflows early-on can save valuable time with each additional model that you build.

No system is an island

Do I need to make batched or real-time predictions? Embedded models or interfaces? RPC or REST?

Your 99%-validation-accuracy model is not of much use unless it interfaces with the rest of your production system. The decision here is at least partially driven by your use-case.
A model performing a simple task might perform satisfactorily with its weights packaged directly into your application, while more complicated models might require communication with centralized heavy-lifting servers.
In our case, most of our production systems perform tasks offline in batches, while a minority serve real-time predictions via JSON-RPC over HTTP.
Knowing the answer to these questions might also restrict the types of architectures that you should consider when building your models. Building a complex model, only to later learn that it cannot be deployed within your mobile app is a disaster that can be easily avoided.

Monitoring performance

How do I keep track of my predictions? Do I log my results to a database? What about online learning?

After building, training and deploying your models to production, the task is still not complete unless you have monitoring systems in place. A crucial component to ensuring the success of your models is being able to measure and quantify their performance. A number of questions are worth answering in this area.
How does my model affect the overall system performance? Which numbers do I measure? Does the model correctly handle all possible inputs and scenarios?
Having used Postgres in the past, I favor using it for monitoring my models. Periodically saving production statistics (data samples, predicted results, outlier specifics) has proven invaluable in performing analytics (and error postmortems) over deployments.
Another import aspect to consider is the online-learning requirement of your model. Should your model learn new features on the fly? When hoverboards become a reality, should the product-categorizer put it in Vehicles, Toys or leave it Uncategorized? Again, these are important questions worth debating when building your system.

Wrapping it up

there is more to it than just the secret sauce

This post poses more questions than it answers, but that was sort of the point really. With many advances in new techniques and cells and layers and network architectures, it is easier than ever to miss the forest for the trees.
Greater discussion about end-to-end deployments is required among practitioners to take this field forward and truly democratize machine learning for the masses.

https://www.fullstackpython.com/deployment.html
https://engineering.semantics3.com/2016/11/13/machine-learning-practice-to-production/

MSBI TIPS - Collection of dailly notes

Tuesday, 23 May 2017