What are some software and skills that every data scientist should know?

Here are some skill sets I’ve found helpful outlined as a list with no real order :

1. Statistics (Theory and Applied Analytics)

This is the core of what you can offer as a data scientist. A company can always get a database developer, or software engineer. They can’t usually get a software engineer that knows their stats. Therefore, if you only know the basics about standard error, bootstrapping, confidence intervals or bayesian statistics, you better learn more.

I’d suggest picking up “All of Statistics” by Wasserman, and “An Introduction to Statistical Learning” by James/Witten/Hastie/Tibshirani. The latter is available free. For the less mathematically literate “Think Stats” and “Think Bayes” are decent and both are free.

2. Analytics Software (Descriptive / Inferential / Predictive Analytics)

The tools I mention are separated into “regular” and “big” data solutions. The “Big Data” solutions are currently mostly products, or are being built by “big data” engineers and data scientists.

For “regular” data, I’ve found that R and Python are the most useful, however Matlab, SPSS, SAS are also OK. I’d suggest checking out RStudio and the Python scientific stack (see Anaconda distribution).

There are many “Big data” analytics platforms out there. Datameer, Tableau, etc. Each is a little different and usually costs you money.

I’d start with learning R or Python since whichever company you work for may have their own internal “Big Data” system they built or some 3rd party tools they pay for.

3. Visualization Software (Descriptive / Exploratory Analytics and Data Presentation)

Note that visualization is many times part of your analytics software (see #2). Again, the “big data” visualization tools will usually be 3rd party.

R is the big one, but Matlab, SPSS, SAS are also OK. I’d also suggest checking out Plotly and the Python scientific stack (see Anaconda distribution). As a matter of fact, check out Plotly for sure–it’s really cool.

You can also look into some tools like D3 if you are more ambitious and want to build visualizations from scratch (well not totally from scratch, it’s just not as easy as using a tool like R).

4. Machine Learning (Exploratory / Predictive Analytics)

Regressions, clustering, dimensionality reduction, neural networks, decision trees, etc. These are what data scientists use to not only get insight from data, but also to create products or predictive systems used by their company. The core of this is using algorithms to gain insight from data, or to teach a machine to make decisions or predictions.

Again I’d suggest looking into the “All of Statistics” book listed above. Also I’d suggest checking out “The Elements of Statistical Learning” by Hastie/Tibshirani/Friedman. It’s decent and free. Aside from that the book “Pattern Recognition” by Duda is pretty much a standard. However, please note everyone’s going to have different recommendations for the best machine learning books.

Machine learning gets much deeper than this. There is a scientific methodology as well to finding and training a good learning algorithm, and new results are discovered all the time. It’s one of the hottest academic fields at the moment so results are pouring in.

Overall I’d say machine learning is one of those things you just have to keep chipping away at–continue to read new texts and papers.

5. Data Munging (Data Quality / Sampling Theory)

A lot of your time will be spent doing this. The joke is that it takes up 90% of your time, but I’ve found it’s closer to 40-60% depending on the problem.

Data-Munging usually requires knowing how to work with databases, as well as knowledge in scraping data from websites, knowledge of file or web formats, how to format string, and the best way to transform data.  You need to know how to consume data, how to process and create it.

Languages might be SQL as well as command line tools like AWK, SED, GREP, CURL, etc. I’d also say Python and R are useful here since you can make command line tools with them that might scrape a website or convert some file format into something you can use. You also need to know standards like REST, Json, etc.

The key is your answer to this question : “Can you get data from a variety of sources, join it to other data, clean it up, and then spit it out in a form that is readily and easily useable for data analysis?”. If you aren’t sure, learn more about it.

It’s also good to know some things about how your data is collected. This is where some statistics can come in handy. For example, if you sample people for A/B testing using a coin flip but allow people to come back a second time, well, then your data isn’t “independent”. This affects how you might do your analysis later, or you may want to split-out repeat visitors into their own separate data set.

6. Computer Science and Applied Mathematics (Engineering / Theory)

This supports your knowledge in every other thing you do related to working with computers and software.

Typical things to know at a undergrad-level include algorithms, data structures, discrete math, databases. Domain specific things are operating systems, compilers, networking, etc.

You don’t need to know everything in the domain specific category outside of your domain at a deep level or really any level at all. However, for example knowing how people solved a problem in a compiler before might help you solve a problem with how to parse and convert data.

I’d also say there’s some need for numerical analysis here. You need to understand why and when algorithms converge to solutions. For example, why does gradient descent work? Will my algorithm have a problem with round off or truncation error?

7. Data Engineering (Engineering)

This ties everything from “Data Munging” to “Analytics” together. You have to get data stored, accessible, processed, and moved from place to place.

This is where you figure out how to solve the engineering problems associated with the size of your data set. There are  challenges related to moving it around and making it available for analysis.

You may have “Big Data” or “Regular Data”. One isn’t necessarily better than another-as a matter of fact if you know about statistics at all “big data” can be easier to make inferences or predictions from. However, if you know about engineering “big data” is harder to process and move around. There are trade-offs.

These sort of problems are what data scientists are paid to work around and it’s still new. So far the go-to solutions seem to either be using a 3rd party proprietary platform, or to build something in house using open source technology.

Open source technology includes Hadoop, Mahout, MongoDB, MySQL, etc. To use any of those you need to usually learn some associated language. For example, Java with Hadoop/Mahout. NoSQL with MongoDB.

“Open source solutions” might also mean you have to write your own processing algorithms in a language of your choice, and REST APIs in PHP so that data can be consumed elsewhere.

3rd Party solutions mostly require you to learn how to use the software. However you usually benefit from knowing SQL, how columnar databases work, how NoSQL databases work, as well as general programming languages. Some examples here might be Datameer, Redshift, etc.

8. Development Tools

This largely depends on your domain, but it includes things like IDEs, revision control software, unit testing libraries, and unit testing coverage libraries.

No matter the domain however, Github is a something you really should learn. It’s used almost everywhere.

9. Knowledge of software engineering techniques and writing quality code

If you can’t write quality code, you’ll piss off your fellow engineers and cause more work for yourself. Most shops ascribe to the Agile software development method even if it rebrands some of the things older development shops already did. You may as well learn about it.

Code quality is subjective, so Python might favor some practices over others as an example. It’s hard to make some definitive recommendation without knowing the language you work in.

A good place to start is Understanding the GitHub Flow since it ties into #8. However every dev shop is a bit different, and has their own philosophy so learning to communicate with the dev team and adapting to how they do things is a must.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s