That brings us to the third phase for Zillow: A migration toAWS (Amazon Web Services), with the long-term goal of moving the entire operation there. At the same time, the company is making the leap from proprietary to open source technologies.
Embracing the cloud and open source
Conventional wisdom says that, although the cloud may be ideal for short-term jobs, the ongoing cost of cloud services becomes too burdensome over the long haul. Not so in the case of Zillow, says Thind:
We did a fairly deep analysis on our cost at the kilobyte level, the amount of data we project we're going to have. We're storing a lot of our data in S3, which is relatively cheap, and of course using Glacier to make it even cheaper.
Thind says Zillow is also opting to use the Amazon Kinesis streaming data platform, because he found the cost of using Kinesis "fairly reasonable," considering it's a managed service. The other candidate was Kafka, the open source messaging system, but the convenience of Kinesis tipped the balance.
In other cases, open source technologies deployed in the cloud by Zillow have won out. Zillow started with Microsoft SQL Server as its primary data store, but on AWS it's in the process of moving to Redis, an open source, key-value-pair database. Thind says the company considered migrating to AWS-native database serviceDynamoDB instead, but it was "just not cost-effective."
Zillow has also become a major Spark shop. According to Thind:
Zestimate is composed of lots of different types of machine learning models, and we can run a lot of these models in parallel. There are 3,000 counties and we want to be able to run these models in parallel across multiple nodes. With Spark we're able to achieve that -- run a lot of things in parallel and produce Zestimates much faster and more frequently.
For machine learning, Zillow employs various decision tree, random forest, and regression algorithms and is currently doing prototyping using deep learning. Soon, says Thind, Zillow will use Spark as a deep learning platform as well. "Primarily, we're leveraging a lot of the machine learning stuff out of R through Spark," he says. The company is also using Spark streaming for the models to consume the data, with Amazon Kinesis as the underlying platform.
Data in, building out
One effect of Zillow's leading position is that it has become a canonical data source. "It's our focus that Zillow be the largest, most trusted, vibrant marketplace for real estate information. That mission is best served by making our data the most authoritative and the most ubiquitous out there," says Humphries.
Clean real estate data is not easy to come by. According to Humphries, one of Zillow's core innovations from the start was to ingest data from various standard sources and integrate it into "a reconciled representation of real estate facts." This meant dealing with a host of different data formats -- plus, not all the data was digital, which meant much had to be keyed in. Altogether, he characterizes it as "very messy, noisy data."
Sign up for CIO Asia eNewsletters.