For those interested in the details, back propagation uses the gradient of the error (or cost) function with respect to the weights and biases of the model to discover the correct direction to minimize the error. Two things control the application of corrections: the optimization algorithm and the learning rate variable, which usually needs to be small to guarantee convergence and avoid causing dead ReLU neurons.
Optimizers for neural networks typically use some form of gradient descent algorithm to drive the back propagation, often with a mechanism to help avoid becoming stuck in local minima, such as optimizing randomly selected minibatches (Stochastic Gradient Descent) and applying momentum corrections to the gradient. Some optimization algorithms also adapt the learning rates of the model parameters by looking at the gradient history (AdaGrad, RMSProp, and Adam).
As with all machine learning, you need to check the predictions of the neural network against a separate test data set. Without doing that, you risk creating neural networks that only memorize their inputs instead of learning to be generalized predictors.
Now that you know something about machine learning and neural networks, it's only a small step to understanding the nature of deep learning algorithms.
The dominant deep learning algorithms are deep neural networks (DNNs), which are neural networks constructed from many layers (hence the term "deep") of alternating linear and nonlinear processing units, and are trained using large-scale algorithms and massive amounts of training data. A deep neural network might have 10 to 20 hidden layers, whereas a typical neural network may have only a few.
The more layers in the network, the more characteristics it can recognize. Unfortunately, the more layers in the network, the longer it will take to calculate, and the harder it will be to train.
Another kind of deep learning algorithm is Random Decision Forests (RDFs). Again, they are constructed from many layers, but instead of neurons the RDF is constructed from decision trees and outputs a statistical average (mode or mean) of the predictions of the individual trees. The randomized aspects of RDFs are the use of bootstrap aggregation (bagging) for individual trees and taking random subsets of the features.
Understanding why deep learning algorithms work is nontrivial. I won't say that nobody knows why they work, since there have been papers on the subject, but I will say there doesn't seem to be widespread consensus about why they work or how best to construct them.
The Google Brain people creating the deep neural network for the new Google Translate didn't know ahead of time what algorithms would work. They had to iterate and run many weeklong experiments to make their network better, but sometimes hit dead ends and had to backtrack. (According to the New York Times article cited earlier, "One day a model, for no apparent reason, started taking all the numbers it came across in a sentence and discarding them." Oops.)
Sign up for CIO Asia eNewsletters.