How can I become a data scientist?

This needs a considerable amount of thought. Presently almost all programs being offered by universities are post graduate (Master’s) level or certificate courses which presume prerequisites such as fundamentals of computer science, network engineering, programming, and mathematics.

‘Data Science’ is generally considered to be a combination of the following disciplines: 1. Computer Science 2. Statistics 3. Domain Expertise

Hence the core subjects which I recommend are related to the above disciplines. I’ve gathered these from respective courses at MIT, Stanford, and other open sources. Some of these are courses which I’m also undertaking to fill gaps in my understanding of these domains. So basically these courses can be searched online in the form of MOOCs, open university courses, free text books/videos for free.

Computer Science

1. Introduction to Computer Science and Programming (with Python)

2. Computer System Engineering: This course covers topics on the engineering of computer software and hardware systems, techniques for controlling complexity; strong modularity using client-server design, virtual memory, and threads; networks; atomicity and coordination of parallel activities; recovery and reliability; privacy, security, and encryption; and impact of computer systems on society.

3. Computation Structures: An introduction to the engineering of digital systems. Starting with MOS transistors, the course develops a series of building blocks — logic gates, combinational and sequential circuits, finite-state machines, computers and finally complete systems in terms of both hardware and software.

4. Introduction to Algorithms: It covers the common algorithms, algorithmic paradigms, and data structures used to solve computational problems.

5. Artificial Intelligence: This course introduces students to the basic knowledge representation, problem solving, and learning methods of artificial intelligence.

6. Object Oriented Programming with C/C++/Java

Mathematics and Statistics

1. Applied Mathematics: An introduction to discrete mathematics oriented toward computer science and engineering.

2. Introduction to Probability & Statistics (with R Programming): This course provides an elementary introduction to probability and statistics with applications. Topics include: Random variables, probability distributions, Bayesian inference, hypothesis testing, confidence intervals, and linear regression.

3. Linear Algebra (with R programming or another math tool): This course covers matrix theory and linear algebra

4. Statistical / Machine Learning (with R programming): Introduction to core algorithms of data science such as types of linear and nonlinear regression, classification techniques such as logistic regression, Naïve Bayes, SVM, decision trees (vanilla decision trees, random forests, boosting), unsupervised learning methods such as clustering, introduction to neural networks

5. Advanced Machine Learning (with Python programming): A specialization for students with a keener interest in Artificial Intelligence with a focus on neural networks for image/text processing.

Domain Expertise

Ideally these should be based on work interest/domain, so that every student chooses a domain to specialize in (e.g. Web development , Mobile application development, Marketing Analytics, Supply Chain, Finance, Manufacturing etc.).

Data Science Specific Courses

The core subjects herein should be:

1. Data Collection and Cleansing: This should include data scraping from the web using open source tools such as Python/R, connecting to databases etc. Also data cleaning and ETL concepts like de-duplication, consolidation, missing data estimation techniques for creating the analytical data set.

2. Data Visualization and Reporting: Using tools like SAS/SAP, or R/Python creating BI dashboards, demonstrating insights through visualizations and data story presentations.

3. Data Science Applications 1 / 2: Complete end to end data science projects with a business focus. This subject should probably be repeated twice in the final years. It should very importantly include connecting to real databases and deployment of models in production and not mere ad hoc analysis on static data sets.

4. Advanced Data Computing: Students herein should create projects with large scale data using either open source and proprietary tools such as Hadoop/Spark, HANA, or other MPP databases

I would also include the following:

1. Fundamentals of Network engineering. Reason: A graduate should understand computer networks to be able to work with, manage, and if needed improve the network and data architecture at an organization. Subjects would include: Network engineering, Databases, Data Warehousing.

2. Research Methodology: To be able to design projects in a systematic way from hypothesis generation to generating business recommendations using BOTH quantitative and qualitative methodologies.

3. Unstructured Data Analysis: Students should know basics of text mining, natural language processing, social media mining, web mining, and such applications. These could also be in the form of electives.

An interesting thing to note is that great programs and universities do not have a tool focus. Any tool (R/SAS/SAP/Python/others) should be taught ideally, as a supplement to theoretical concepts. For example, R programming with Statistics and Probability. Python for Neural Networks and other Machine Learning tasks. SAS VA or SAP Lumira with data visualization and data reporting concepts. SQL with database concepts and so on. This is one area wherein many new data science programs are lacking, and hence as a result are producing graduates who are merely application developers or users rather than being able to solve real world problems.

Hope this helps!



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s