Where to Start: Absolute Beginners

Never opened a coding program before?

Have no fear! There have never been more beginner-friendly resources.

Here’s a quick guide on how to get started:

First, confirm that you have a good grasp of statistics and probability, linear algebra, and project management. While not essential to programming, these skills are invaluable and without them it is very difficult to build, de-bug, or interpret most data science models.

If you do need to learn (or refresh) some statistics, I recommend a course that combines theoretical knowledge with intro programming, such as the “EdX: Statistics and R” or “EdX: Probability and Statistics in Data Science using Python”. For other, more conceptual, courses on statistics and probability, check out the “Recommended Skills” section of this guide.

Remember, for EdX you can always “audit” a course for free, so don’t get intimidated by the price tags for certificates!

The other “Recommended Skills” are also helpful, though you can learn these simultaneously with programming. I particularly recommend learning more about “Research Methodology”, “Transparent and Open Source Research Methods”, and “Causal Identification Techniques”, especially if you are applying data science for research or impact assessment purposes.

Choose your Path: Python or R?

While there are dozens of programming languages that can be applied to data science, the two most popular are Python and R. Most data scientists have a “primary language” that they usually work in, though the good ones are familiar with multiple languages and can switch between them depending on project needs. For beginners, it’s best to pick one until you are comfortable to avoid confusion in syntax and packages.

Python is arguably the most common choice, as it can be used for a wide range of programming activities both within and outside of data science, though I’ve found it has a slightly steeper learning curve and less amateur friendly external packages. If you want to learn data science as part of a larger interest in computer science, and may want to learn web-development, app building, and other skills later on–or you anticipate working with other programmers more familiar with these activities–choose Python.

R is more widely used among data specialists, including physical and social scientists, statisticians, and other “non-programmer” data scientists. It cannot interface with as many external applications as Python but I’ve found it easier to learn as an amateur, plus it has simple external packages with a large dedicated community of users. If you want to learn data science to advance another skill-set, such as research or business–and don’t plan on expanding into more “computer science” oriented programming–choose R.

Why use GitHub?

GitHub is by far the most popular project management platform for programmers of all stripes. It offers extensive version control and communication for teams or individuals, plus publication, presentation, and web hosting services. Even if you don’t anticipate learning more about programming or hosting a team project, knowing how to navigate and use GitHub is invaluable, as most other data scientists use it for their work. “Can you use GitHub?” is typically a make-or-break interview question, since teams and entire companies can host their digital library there.

Even if you don’t intend to work with teams or desire to publish open-source code, knowing Git–the language underpinning GitHub–can help you write more understandable, replicable code, which is quickly becoming a prerequisite for anyone to publish or share their results.