Why You Shouldn’t be a Data Science Generalist

I work at a data science mentorship startup,
and I’ve found there’s a single piece of advice that I catch myself
giving over and over again to aspiring mentees. And it’s really not what
I would have expected it to be.

Rather than suggesting a new library or tool, or some resume hack, I find myself recommending that they first think about what kind of data scientist they want to be.

The reason this is crucial is that data science isn’t a single,
well-defined field, and companies don’t hire generic, jack-of-all-trades
“data scientists”, but rather individuals with very specialized skill
sets.

To see why, just imagine that you’re a company trying to hire a data
scientist. You almost certainly have a fairly well-defined problem in
mind that you need help with, and that problem is going to require some
fairly specific technical know-how and subject matter expertise. For
example, some companies apply simple models to large datasets, some
apply complex models to small ones, some need to train their models on
the fly, and some don’t use (conventional) models at all.

Each of these calls for a completely different skill set, so it’s
especially odd that the advice that aspiring data scientists receive
tends to be so generic: “learn how to use Python, build some
classification/regression/clustering projects, and start applying for
jobs.”

Those of us who work in the industry bear a lot of the blame for
this. We tend to lump an excessive number of things into the “data
science” bucket in casual conversations, blog posts and presentations.
Building a robust data pipeline for production? That’s a “data science
problem.” Inventing a new kind of neural network? That’s a “data science
problem.”

That’s not good, because it tends to cause aspiring data scientists
to lose focus on specific problem classes, and instead become jacks of
all trades — something that can make it harder to get noticed or break
through, in a market that’s already saturated with generalists.

But it’s hard to avoid becoming a generalist if you don’t know which
common problem classes you could specialize in in the fist place. That’s
why I put together a list of the five problem classes that are often
lumped together under the “data science” heading:

1. Data engineer

Job description: You’ll be managing data pipelines for
companies that deal with large volumes of data. That means making sure
that your data is being efficiently collected and retrieved from its
source when needed, cleaned and preprocessed.

Why it’s important: If you’ve only ever worked with
relatively small (<5 Gb) datasets stored in .csv or .txt files, it
might be hard to understand why there would exist people whose full-time
jobs it is to build and maintain data pipelines. Here are a couple of
reasons: 1) A 50 Gb dataset won’t fit in your computer’s RAM, so you
generally need other ways to feed it into your model, and 2) that much
data can take a ridiculous amount of time to process, and often has to
be stored redundantly. Managing that storage takes specialized technical
know-how.

Requirements: The technologies you’ll be working
with include Apache Spark, Hadoop and/or Hive, as well as Kafka. You’ll
most likely need to have a solid foundation in SQL.

The questions you’ll be dealing with sound like:

→ “How do I build a pipeline that can handle 10 000 requests per minute?”

→ “How can I clean this dataset without loading it all in RAM?”

2. Data analyst

Job description: Your job will be to translate data
into actionable business insights. You’ll often be the go-between for
technical teams and business strategy, sales or marketing teams. Data
visualization is going to be a big part of your day-to-day.

Why it’s important: Highly technical people often
have a hard time understanding why data analysts are so important, but
they really are. Someone needs to convert a trained and tested model and
mounds of user data into a digestible format so that business
strategies can be designed around them. Data analysts help to make sure
that data science teams don’t waste their time solving problems that
don’t deliver business value.

Requirements: The technologies you’ll be working with include Python, SQL, Tableau and Excel. You’ll also need to be a good communicator.

The questions you’ll be dealing with sound like:

→ “What’s driving our user growth numbers?”

→ “How can we explain to management that the recent increase in user fees is turning people away?”

3. Data scientist

Job description: Your job will be to clean and explore
datasets, and make predictions that deliver business value. Your
day-to-day will involve training and optimizing models, and often
deploying them to production.

Why it’s important: When you have a pile of data
that’s too big for a human to parse, and too valuable to be ignored, you
need some way of pulling digestible insights from it. That’s the basic
job of a data scientist: to convert datasets into digestible
conclusions.

Requirements: The technologies you’ll be working
with include Python, scikit-learn, Pandas, SQL, and possibly Flask,
Spark and/or TensorFlow/PyTorch. Some data science positions are purely
technical, but the majority will require you to have some business
sense, so that you don’t end up solving problems that no one has.

The questions you’ll be dealing with sound like:

→ “How many different user types do we really have?”

→ “Can we build a model to predict which products will sell to which users?”

4. Machine learning engineer

Job description: Your job will be to build, optimize
and deploy machine learning models to production. You’ll generally be
treating machine learning models as APIs or components, which you’ll be
plugging into a full-stack app or hardware of some kind, but you may
also be called upon to design models yourself.

Requirements: The technologies you’ll be working
with include Python, Javascript, scikit-learn, TensorFlow/PyTorch
(and/or enterprise deep learning frameworks), and SQL or MongoDB
(typically used for app DBs).

The questions you’ll be dealing with sound like:

→ “How do I integrate this Keras model into our Javascript app?”

→ “How can I reduce the prediction time and prediction cost of our recommender system?”

5. Machine learning researcher

Job description: Your job will be to find new ways to
solve challenging problems in data science and deep learning. You won’t
be working with out-of-the-box solutions, but rather will be making your
own.

Requirements: The technologies you’ll be working
with include Python, TensorFlow/PyTorch (and/or enterprise deep learning
frameworks), and SQL.

The questions you’ll be dealing with sound like:

→ “How do I improve the accuracy of our model to something closer to the state of the art?”

→ “Would a custom optimizer help decrease training time?”

The five job descriptions I’ve laid out here definitely don’t stand
alone in all cases. At an early-stage startup, for instance, a data
scientist might have to be a data engineer and/or a data analyst, too.
But most jobs will fall more neatly into one of these categories than
the others — and the larger the company, the more these categories will
tend to apply.

Overall, the thing to remember is that in order to get hired, you’ll
usually be better off building a more focused skillset: don’t learn
TensorFlow if you want to become a data analyst, and don’t prioritize
learning Pyspark if you want to become a machine learning researcher.

Think instead about the kind of value you want to help companies
build, and get good at delivering that value. That, more than anything
else, is the best way to get in the door.

MSBI TIPS - Collection of dailly notes

Tuesday, 25 December 2018