Rather than set up your own infrastructure, you can use machine learning services in the cloud to build data models. With cloud services you do not need to install a range of tools. Moreover, these services build in more of the expertise needed to get successful results.
Amazon Machine Learning offers several machine learning models you can use with data stored in S3, Redshift or R3, but you can’t export the models, and the training set size is rather limited. Microsoft’s Azure ML Studio has a wider range of algorithms, including deep learning, plus R and Python packages, and a graphical user interface for working with them. It also offers the option to use Azure Batch to periodically load extremely large training sets, and you can use your trained models as APIs to call from your own programs and services. There are also machine learning features such as image recognition built into cloud databases like SQL Azure Data Lake, so that you can do your machine learning where your data is.
Many machine learning techniques use supervised learning, in which a function is derived from labelled training data. Developers choose and label a set of training data, set aside a proportion of that data for testing, and score the results from the machine learning system to help it improve. The training process can be complex, and results are often probabilities, with a system being, for example, 30 percent confident that it has recognized a dog in an image, 80 percent confident it’s found a cat, and maybe even 2 percent certain it’s found a bicycle. The feedback developers give the system is likely a score between one and zero indicating how close the answer is to correct.
It’s important not to train the system too precisely to the training data; that’s called overfitting and it means the system won’t be able to generalize to cope with new inputs. If the data changes significantly over time, developers will need to retrain the system due to what some researchers refer to as “ML rot.”
Machine learning algorithms — and when to use them
If you already know what the labels for all the items in your data set are, assigning labels to new examples is a classification problem. If you’re trying to predict a result like the selling price of a house based on its size, that’s a regression problem because house price is a continuous rather than discrete category. (Predicting whether a house will sell for more or less than the asking price is a classification problem because those are two distinct categories.)
Sign up for CIO Asia eNewsletters.