For data cleaning, data scientists usually use the two most common programming languages - R and Python. R is more of a statistical language, noted Gupta, while Python is a more general purpose programming language that is very good at data manipulation. A good data scientist needs to have the ability to turn all sorts of data from various sources into something that can be used for machine learning, he added.
"If you look at the Internet of Things, there are sensors that have failed and this causes them to put out bad data. A skilled data scientist needs to be able to look at the data and tell if anything doesn't make sense - and then immediately remove it [to prevent the buildup of bad data]," said Gupta.
On the other hand, data transformation refers to the process of changing the format of data so that it can be used by different applications. This may mean a change from the format the data is stored into the format needed by the application that will use the data.
Gupta added that these two processes - data cleansing and data transformation - are touted to be the biggest, and most important jobs of the data scientist before they can conduct any sort of deep analysis on the data.
When embarking on a data science project, a data scientist need to firstly gain an understanding of the problem that needs to be solved; as well as get a high-level overview of the data landscape which is relevant to the problem's domain. Without a problem in mind, there is little sense in learning about the data. Likewise, there is no reason to treat a problem as a data science problem if one knows nothing about the data landscape. Being able to see how data is relevant to, or capable of solving problems, is one of the key skills of a data scientist, said Gupta.
Data scientists need to understand that all data carry some form of meaning and it is critical for them to understand that said meaning. They have to look beyond the numbers and understand what they represent and try to gain valid insights from it, he added.
Once they understand the data and the problem at hand, that's when they can match the algorithm and develop a meaningful solution.
Why data scientists are the 'next big thing'
For many tech companies, having a data scientist is a major competitive advantage. For instance, Microsoft's data scientists contribute to the math that makes sure that products like its Cortana digital personal assistant gets smarter over time.
Sign up for CIO Asia eNewsletters.