Text classification, a part of natural language processing, is a useful way to capture insights from the vast array of unstructured online and digital text that exists on the Web. But doing this effectively can be costly and time consuming.
Expert data scientists Michael Tamir from Galvanize, a tech education company, and Daniel Hansen from Personagraph, an audience intelligence company, discussed at the recent Predictive Analytics World event in San Francisco how Google's Word2Vec addresses this problem.
Word2Vec, which Google open sourced in 2013, is a tool that contains a couple of algorithms and uses neural networks to learn the vector representation of a word that is useful for predicting other words in a sentence.
"What we're doing is ... running Word2Vec on a very large corpus of text, which includes 100 billion words or something on that scale," Hansen said.
"So you have a topic [vector], which will have some length and direction. What we've seen in practice is that if you look at other words that have vectors in this neighbourhood of your topic vector, they'll be highly correlated. So the vector of music will be very close to the vector of song or tune or melody, both in length and direction."
Tamir said that working with a space of vectors and not just a bag of individual terms means you can make use of all sorts of mathematical structures, whereas you cannot necessarily do this with a set of words.
"You're mapping from terms into a vector space. And that vector space has typically around 300-400 dimensions.
"I can subtract vectors, I can rotate the vectors, I can look at how far one vector is from another. So by embedding these words into a vector space, we can capture a lot of structure," he said.
Other ways of going about supervised text classification can be expensive, Tamir said. Auto-encoder for feature compression when pre training the data, for example, scales with the order of taxonomy node counts. This means it can get expensive when it starts to reach 5000 to 10,000 nodes that are trained in the taxonomy tree for classifying each document, he said.
"And you are going to need a lot of training data for each one of those nodes," he added. "That means finding a way to pay for the data, getting the data on your own, paying an intern to do it. It can very expensive if you want to do this directly through a traditional supervised [neural] net."
Hansen said Word2Vec has made pre-training "very easy" and is a relatively low investment entry into deep learning for text classification.
Sign up for CIO Asia eNewsletters.