Moving machine learning from practice to production
With growing interest in neural networks and deep learning, individuals and
companies are claiming ever-increasing adoption rates of artificial intelligence
into their daily workflows and product offerings.
Coupled with breakneck speeds in AI-research, the new wave of popularity shows a
lot of promise for solving some of the harder problems out there.
That said, I feel that this field suffers from a gulf between appreciating these developments
and subsequently deploying them to solve "real-world" tasks.
A number of frameworks, tutorials and guides have popped up to democratize
machine learning, but the steps that they prescribe often don't align with the
fuzzier problems that need to be solved.
This post is a collection of questions (with some (maybe even incorrect)
answers) that are worth thinking about when applying machine learning in
production.
Garbage in, garbage out
Do I have a reliable source of data? Where do I obtain my dataset?
While starting out, most tutorials usually include well-defined datasets.
Whether it be MNIST, the
Wikipedia corpus or any of the great options from
the UCI Machine Learning Repository, these
datasets are often not representative of the problem that you wish to solve.
For your specific use case, an appropriate dataset might not even exist and
building a dataset could take much longer than you expect.
For example, at Semantics3, we tackle a number of ecommerce-specific problems
ranging from product categorization to product matching to search
relevance. For each of these problems, we had to look within and spend
considerable effort to generate high-fidelity product datasets.
In many cases, even if you possess the required data, significant (and expensive)
manual labor might be required to categorize, annotate and label your data for training.
Transforming data to input
What pre-processing steps are required? How do I normalize my data before using with my algorithms?
This is another step, often independent of the actual models, that is glossed over
in most tutorials. Such omissions appear even more glaring when exploring deep neural
networks, where transforming the data into usable "input" is crucial.
While there exist some standard techniques for images, like cropping, scaling,
zero-centering and whitening - the final decision is still up to individuals on
the level of normalization required for each task.
The field gets even messier when working with text. Is capitalization
important? Should I use a tokenizer? What about word embeddings? How big should
my vocabulary and dimensionality be? Should I use pre-trained vectors or start
from scratch or layer them?
There is no right answer applicable across all situations, but keeping abreast of
available options is often half the battle. A recent post
from the creator of spaCy details an interesting strategy to standardize
deep learning for text.
Now, let's begin?
Which language/framework do I use? Python, R, Java, C++? Caffe, Torch, Theano, Tensorflow, DL4J?
This might be the question with the most opinionated answers. I am
including this section here only for completeness and would gladly point you to
thevariousotherresources
available for making this decision.
While each person might have different criteria for evaluation, mine has
simply been ease of customization, prototyping and testing. In that aspect,
I prefer to start with scikit-learn where
possible and use Keras for my deep learning projects.
Further questions like Which technique should I use? Should I use deep
or shallow models, what about CNNs/RNNs/LSTMs? Again, there are a number of
resources to help make decisions and this is perhaps the most
discussed aspect when people talk about "using" machine learning.
Training models
How do I train my models? Should I buy GPUs, custom hardware, or ec2 (spot?) instances? Can I parallelize them for speed?
With ever-rising model complexity, and increasing demands on processing
power, this is an unavoidable question when moving to production.
A billion-parameter network might promise great performance with its
terabyte-sized dataset, but most people cannot afford to wait for weeks while
the training is still in-progress.
Even with simpler models, the infrastructure and tooling required for the build-up,
training, collation and tear-down of tasks across instances can be quite
daunting.
Spending some time on planning your infrastructure
, standardizing setup and defining
workflows early-on can save valuable time with each additional model that you build.
No system is an island
Do I need to make batched or real-time predictions? Embedded models or interfaces? RPC or REST?
Your 99%-validation-accuracy model is not of much use unless it interfaces with
the rest of your production system. The decision here
is at least partially driven by your use-case.
A model performing a simple task might perform satisfactorily with its weights packaged
directly into your application, while more complicated models might require
communication with centralized heavy-lifting servers.
In our case, most of our production systems perform tasks offline in batches,
while a minority serve real-time predictions via JSON-RPC over HTTP.
Knowing the answer to these questions might also restrict the types of
architectures that you should consider when building your models. Building a
complex model, only to later learn that it cannot be deployed within your
mobile app is a disaster that can be easily avoided.
Monitoring performance
How do I keep track of my predictions? Do I log my results to a database? What about online learning?
After building, training and deploying your models to production, the task is
still not complete unless you have monitoring systems in place. A crucial component
to ensuring the success of your models is being able to measure and
quantify their performance. A number of questions are worth answering in this
area. How does my model affect the overall system performance? Which numbers do I
measure? Does the model correctly handle all possible inputs and scenarios?
Having used Postgres in the past,
I favor using it for monitoring my models. Periodically saving production
statistics (data samples, predicted results, outlier specifics) has proven
invaluable in performing analytics (and error postmortems) over deployments.
Another import aspect to consider is the online-learning requirement of
your model. Should your model learn new features on the fly? When hoverboards
become a reality,
should the product-categorizer put it in Vehicles, Toys or leave it
Uncategorized? Again, these are important questions worth debating when
building your system.
Wrapping it up
there is more to it than just the secret sauce
This post poses more questions than it answers,
but that was sort of the point really. With many advances
in new techniques and cells and layers and network architectures, it is
easier than ever to miss the forest for the trees.
Greater discussion about end-to-end deployments is required among
practitioners to take this field forward and truly democratize
machine learning for the masses.
No comments:
Post a Comment