IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90% of the data in the world has been created in the last two years. Gartner projects that by 2015, 85% of Fortune 500 organizations will be unable to exploit big data for competitive advantage and about 4.4 million jobs will be created around big data. Although these estimates should not be interpreted in absolute sense, they are a strong indication of the ubiquity of big data and the strong need for analytical skills and resources, because as the data piles up, managing and analyzing these data resources in the most optimal way become critical success factors in creating competitive advantage and strategic leverage. To address these challenges, companies are hiring data scientists. However, in the industry, there are strong misconceptions and disagreements about what constitutes a good data scientist. In this article, we will discuss the key characteristics of what makes up a good data scientist. It is based upon the author's consulting and research experience, having collaborated with many companies world-wide on the topic of big data and analytics.
A data scientist should be a good programmer!
As per definition, data scientists work with data. This involves plenty of activities such as sampling and preprocessing of data, model estimation and post-processing (e.g. sensitivity analysis, model deployment; backtesting, model validation). Although many user-friendly software tools are on the market nowadays to automate this, every analytical exercise requires tailored steps to tackle the specificities of a particular business problem. In order to successfully perform these steps, programming needs to be done. Hence, a good data scientist should possess sound programming skills in e.g. R, Python, SAS, etc. The programming language itself is not that important as such, as long as he/she is familiar with the basic concepts of programming and knows how to use these to automate repetitive tasks or perform specific routines.
A data scientist should have solid quantitative skills!
Obviously, a data scientist should have a thorough background in statistics, machine learning and/or data mining. The distinction between these various disciplines is getting more and more blurred and is actually not that relevant. They all provide a set of quantitative techniques to analyze data and find business relevant patterns within a particular context (e.g. risk management, fraud detection, marketing analytics). The data scientist should be aware of which technique can be applied when and how. He/she should not focus too much on the underlying mathematical (e.g. optimization) details but rather have a good understanding of what analytical problem a technique solves, and how its results should be interpreted. Also important in this context is to spend enough time validating the analytical results obtained so as to avoid situations often referred to as data massage and/or data torture whereby data is (intentionally) misrepresented and/or too much focus is spent discussing spurious correlations. When selecting the optimal quantitative technique, the data scientist should take into account the specificities of the business problem. Typical requirements for analytical models are: actionability (to what extent is the analytical model solving the business problem?), performance (what is the statistical performance of the analytical model?), interpretability (can the analytical model be easily explained to decision makers?), operational efficiency (how much efforts are needed to setup, evaluate and monitor the analytical model?), regulatory compliance (is the model in line with regulation?) and economical cost (what is the cost of setting up, running and maintaining the model?). Based upon a combination of these requirements, the data scientist should be capable of selecting the best analytical technique to solve the business problem.
Sign up for CIO Asia eNewsletters.