If you're buying a home or looking for an apartment, most likely Zillow.com will come to mind first, which is a branding triumph for a website that launched only 10 years ago. Today the Zillow Group is a public company with $645 million in revenue that also operates websites for mortgage and real estate professionals -- and completed the acquisition of its nearest competitor, Trulia, last year.
From the start, Zillow offered the "Zestimate," its value-forecasting feature for homes in locations across the United States. Currently, Zillow claims to have Zestimates for more than 100 million homes, with 100-plus attributes tracked for each property. The technology powering Zestimates and other features has advanced steadily over the years, with open source and cloud computing playing increasingly important roles.
Last week I interviewed Stan Humphries, chief analytics officer at Zillow, along with Jasjeet Thind, senior director of data science and engineering. With diverse data sources, a research group staffed by a dozen economists, and predictive modeling enhanced by a large helping of machine learning, Zillow has made major investments in big data analytics as well as the talent to ensure visitors get what they want. Together, Humphries and Thind preside over a staff of between 80 and 90 data scientists and engineers.
An analytics platform grows up
Humphries says Zillow's technology has evolved in three phases, the common thread being the R language, which the company's data scientists have used for predictive modeling from the beginning. At first, R was used for prototyping to "figure out what we wanted to do and what the analytic solution looked like." Data scientists would write up specifications that described the algorithm, which programmers would then implement in Java and C++ code.
That system worked reasonably well, says Humphries, but with serious shortcomings:
You could bring in people from the data science side who were great with machine learning and methodologies -- and separate that really expansive, creative thinking on the solution side from the actual implementation side. That was the attractiveness of that model. The downside to that model is that it's very slow ... You sometimes end up in a suboptimal situation when you've let a data scientist think up a solution before letting an engineer think about how it's actually being implemented.
By the same token, he says, troubleshooting required an awkward round trip. A problem might arise in a production system running C++ or Java, which needed to be diagnosed by a data scientist who was accustomed to working in R.
The second phase of Zillow's technology development was mainly about developing parallelization frameworks, so more of the production implementation could be done in R and less recoding in C++ and Java would be required. Using R in production required an investment in more powerful hardware, or "scaling vertically by getting bigger and bigger machines," as Humphries puts it. Additionally, for certain batch jobs such as recomputing the value of homes over decades, Zillow turned to the Amazon cloud.
Sign up for CIO Asia eNewsletters.