"It avoids the expense of state of the art neural net technologies. You end up with something that is much better than a standard bag of words representation for doing classification on because you get this vector compression effect for free. And this is very nice to do on a large taxonomy. You can scale up your taxonomy very easily and you don't have all these pre-training costs for each individual topic."
He added that the bag of words model in natural language processing performs quite poorly when there's a small training sample.
"Whereas with Word2Vec you can see there's substantial lift in terms of area under the curve, [a measure for classification accuracy]."
Label imbalance is another issue that Word2Vec can help addresses, Hansen said.
"You have a model that is 95 per cent accurate on a balanced test set, and say you have a 100 examples of 'class struggle' documents and 100 random documents and you can classify 95 per cent of them correctly.
"If you then use that model in the wild where 'class struggle' related documents are one in 10,000, your accuracy will drop significantly to less than 1 per cent. It's a fundamental problem with imbalance, it hampers any success you can get with text classification."
He compared Word2Vec (W2V) with bag of words (BOW) to see how they performed in precision and recall, known as the F1 score which measures accuracy of a model. With 300 features, the BOW received an F1 score of about 93 per cent, and then dropped significantly to about 88 per cent when subsetting down to 10 features.
"We do further feature selection we sharpen those distributions up, which is something necessary to deal with the imbalance problem. However, in doing so we significantly reduced our F1 score. So that's unfortunate and if you're going to work with a bag of words representation it's an intrinsic struggle to deal with," Hansen said.
When applying W2V on the same features, the F1 score overall is higher and doesn't drop significantly when subsetting the features. With 300 features, it received a F1 score of about 96 per cent, and with 10 features about 94 per cent.
"The results are competitive with a highly optimised bag of words representation. And for the situation where you have a small number of training examples completely dominate, it beat it by a mile."
Sign up for CIO Asia eNewsletters.