Skip to main content

How to Stop Worrying and Start Modeling Big Data with Better Algorithms and H2O

Srisatish Ambati (0xdata Inc), Cliff Click (0xdata)
Data Science Beekman Parlor - Sutton North
Average rating: ***..
(3.17, 6 ratings)

Data Modeling has been constrained through scale; Sampling still rules the day for Adhoc Analytics. Scale brings much needed change to the modeling world. In this talk we present the predictive power of using sophisticated algorithms on big datasets. With large data sizes comes the particularly hard problem of unbalanced data with multiple asymmetrically rare classes. Missing features pose unique problems for most Classification and Regression algorithms and proper handling can lead to greater predictive power. In the race for Better Predictions, H2O makes practical techniques accessible to manyone through an easy-to-use software product.

H2O is an open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms while keeping the widely used languages of R and JSON as an API. And integrates neatly into popular data ecosystems of hadoop, amazon s3, nosql and sql. We briefly discuss design choices in the implementation of Distributed Random Forest and Generalized Linear Modeling and bringing speed and scale to vox populi of Data Science, R. We take a peek at the elegant lego-like infrastructure that brings fine grained parallelism to math over simple distributed arrays.

A short hacking data demo presents the life cycle of Data Science: Powerful Data Manipulation via R at scale, Interactive Summarization over large datasets, Modeling using Elastic Net (GLM), Grid Search for best parameters & low-latency scoring.

Srisatish Ambati

0xdata Inc

Sri is co-founder and ceo of 0xdata (@hexadata), the builders of H2O. H2O democratizes bigdata science and makes hadoop do math for better predictions. Before 0xdata, Sri spent time scaling R over bigdata with researchers at Purdue and Stanford. Prior to that Sri co-founded Platfora and was the Director of Engineering at DataStax. Before that Sri was Partner & Performance engineer at java multi-core startup, Azul Systems, tinkering with the entire ecosystem of enterprise apps at scale. Before that Sri was at sabbatical pursuing Theoretical Neuroscience at Berkeley. Prior to that Sri worked on nosql trie based index for semistructured data at in-memory index startup RightOrder.

Sri is known for his knack for envisioning killer apps in fast evolving spaces and assembling stellar teams towards productizing that vision. A regular speaker in the BigData, NoSQL and Java circuit, Sri leaves trail @srisatish.

Cliff Click

0xdata

Cliff Click is the CTO and Co-Founder of 0xdata, a firm dedicated to creating a new way to think about web-scale data munging and real-time analytics. He wrote his first compiler when he was 15 (Pascal to TRS Z-80!), although Cliff’s most famous compiler is the HotSpot Server Compiler (the Sea of Nodes IR). Cliff helped Azul Systems build an 864 core pure-Java mainframe that keeps GC pauses on 500Gb heaps to under 10ms, and worked on all aspects of that JVM. Before that he worked on HotSpot at Sun Microsystems, and was at least partially responsible for bringing Java into the mainstream.

Cliff is invited to speak regularly at industry and academic conferences and has published many papers about HotSpot technology. He holds a PhD in Computer Science from Rice University and about 15 patents.

Comments on this page are now closed.

Comments

10/30/2013 8:32pm EDT

Would it be possible to post the slides here, like the other speakers have?

Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners
@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts