According to a recent study of LinkedIn profiles by RJMetrics, the number of data scientists has doubled over the last four years. This reflects the increasing demand for sophisticated data analysis skills, combining computer programming with statistics.
The study also found that over 11,000 data scientists are currently employed by companies worldwide, with Microsoft holding the greatest number of jobs with 227 data scientists on board.
Dubbed the "sexiest job of the 21st century" by the Harvard Business Review, a data scientist is essentially a mashup of traditional careers from data analysis, economics, statistics, computer science and others - and it definitely goes beyond collecting and analysing data.
To have a better understanding of this 'sexy' job, I spoke to Boston-based Vivek Gupta, Microsoft's senior data scientist, and Vivek Ravindran, Director, Data Insights Lead, Microsoft Asia Pacific; to get an insight into what a data scientist actually does, and why they are increasingly crucial in an organisation.
With an extensive background in computer science and machine learning visualisation, Gupta is a data scientist under Microsoft's Product division. He works with customers to understand their business problems and uses machine learning and predicting models on Azure Machine Learning to help solve them. Most recently, Gupta was part of Nokia's Smart Devices division where he worked on applying data analytics solution to problems in the areas of understanding user behavior around location information, photography and application usage.
"My role at Microsoft is twofold. Firstly, I work with customers and help them analyse their data, formulate their problem statement, and determine the value in that to them as an organisation; on top of the typical data science thing - which is spending 80 percent of the time looking at the data and figuring out what they (the customer) can do with it," said Gupta.
"The second aspect of my job is feature engineering. Feature engineering is fundamental to the application of machine learning, and it is about taking the existing data to potentially create new variables," he added.
He went on to cite an example of a time series data from a machine. Data is generated every minute, but this minute-by-minute variable may not be very valuable. What's important is the average value for the last hour in order to identify patterns and trends; and this enables the creation of a new value called the 'lag variable', which represents that last hour of data. Evaluation of such data can determine how to create these new values, which makes it easier to figure out what's the right machine learning algorithm to use. Essentially, feature engineering is about using domain knowledge of the data to create features that make machine learning algorithms work.
Sign up for CIO Asia eNewsletters.