©2011, O'Reilly Media, Inc.
(800) 889-8969 or (707) 827-7019
Monday-Friday 7:30am-5pm PT
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
Most people in our community are accustomed to thinking of a “model” as the end result of a properly functioning big data architecture. Once you have an EC2 cluster reserved, after the database is distributed across some Hadoop nodes, and once a clever MapReduce machine learning algorithm has done its job, the system spits out a predictive model. The model hopefully allows an organization to conduct its business better.
This waterfall approach to modeling is embedded in the hiring process and technical culture of most contemporary big data organizations. When the business users sit in one room and the data scientists sit in another, we preclude one of the most important benefits of having on-demand access to big data. Models themselves are powerful exploratory tools! However, data sparsity, non-linear interactions and the resultant model’s quirks must be interpreted through the lens of domain expertise. All big data models are wrong but some are useful, to paraphrase the statistician George Box.
A data scientist working in isolation could train a predictive model with perfect in-sample accuracy, but only an understanding of how the business will use the model lets her balance the crucial bias / variance trade-off. Put more simply, applied business knowledge is how we can assume a model trained on historical data will do decently with situations we have never seen.
Models can also reveal predictors in our data we never expected. The business can learn from the automatic ranking of predictor importance with statistical entropy and multicollinearity tools. In the extreme, a surprisingly important variable that turns up during the modeling of a big data set could be the trigger of an organizational pivot. What if a movie recommendation model reveals a strange variable for predicting gross at the box office?
My presentation introduces exploratory model feedback in the context of big (training) data. I will use a real-life case study from Altos Research that forecasts a complex system: real estate prices. Rapid prototyping with Ruby and an EC2 cluster allowed us to optimize human time, but not necessarily computing cycles. I will cover how exploratory model feedback blurs the line between domain expert and data scientist, and also blurs the distinction between supervised and unsupervised learning. This is all a data ecology, in which a model of big data can surprise us and suggest its own future enhancement.
He was a professional software developer for ten years, and has been hacking code for much longer. Ben’s past clients include investment banks like JPMorgan Chase and Credit Suisse, the hedge fund Natura Capital, and EdF Trading an energy trading house. He built a taxonomy browser for Encyclopaedia Britannica in 2004, and previously worked for ThoughtWorks as a convert to agile software engineering.
Ben teaches and speaks on machine learning, software engineering, financial analysis, and the culture of quants. While living in London, Ben was an early contributor to the grassroots cartography project OpenStreetMap. He continues to manage a portfolio of financial assets via a quantitative trading strategy built upon sentiment and predictive analytics. He has an MSc in Finance from London Business School and a BEng in Computer Science from Northwestern University.
For information on sponsorship opportunities at the conference, contact Susan Stewart at firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata Contacts