You track or acquire the factors that matter
Even if you have gobs of data and plenty of data scientists, you may not have data for all the relevant variables. In database terms, you may have plenty of rows but be missing a few columns. Statistically, you may have unexplained variance.
Measurements for some independent variables such as weather observations are easily obtained and merged into the dataset, even after the fact. Other factors may be difficult, impractical or expensive to measure or acquire, even if you know what they are.
Let’s use a chemical example. When you’re plating lead onto copper, you can measure the temperature and concentration of the fluoroboric acid plating bath, and record the voltage across the anodes, but you won’t get good adherence unless the bath has enough peptides in it, but not too much. If you didn’t weigh the peptides you put into the bath, you won’t know how much of this critical catalyst is present, and you’ll be unable to explain the variations in the plate quality using the other variables.
You have ways to clean and transform the data
Data is almost always noisy. Measurements may be missing one or more values, individual values may be out of range by themselves or inconsistent with other values in the same measurement, electronic measurements may be inaccurate because of electrical noise, people answering questions may not understand them or may make up answers, and so on.
The data filtering step in any analysis process often takes the most effort to set up — in my experience, 80% to 90% of total analysis time. Some shops clean up the data in their ETL (extract, transform, and load) process so that analysts should never see bad data points, but others leave all data in the data warehouse or data lake with an ELT (with the transform step at the end) process. That means that even the clearly dirty data is saved, on the theory that the filters and transformations will need to be refined over time.
Even accurate filtered data may need to be transformed further before you can analyze it well. Like statistical methods, machine learning models work best when there are similar numbers of rows for each possible state, which may mean reducing the number of the most popular states by random sampling. Again as with statistical methods, ML models work best when the ranges of all variables have been normalized.
For example, an analysis of Trump and Clinton campaign contributions done in Cortana ML shows how to prepare a dataset for machine learning by creating labels, processing data, engineering additional features and cleaning the data; the analysis is discussed in a Microsoft blog post. This analysis does several transformations in SQL and R to identify the various committees and campaign funds as being associated with Clinton or Trump, to identify donors as probably male or female based on their first names, to correct misspellings, and to fix the class imbalance (the data set was 94% Clinton transactions, mostly small donations). I showed how to take the output of this sample and feed it into a two-class logistic regression model in my “Get started” tutorial for Azure ML Studio.
You've already done statistical analyses on the data
Sign up for CIO Asia eNewsletters.