Performing Data Science with HBase

Aaron Kimball (Zymergen, Inc.), Kiyan Ahmadizadeh (WibiData, Inc.)
Data Science Hadoop: Tools & Technology, Grand East (NY Hilton)
Average rating: ****.
(4.33, 3 ratings)

Performing investigative analysis on data stored in HBase is challenging. Most higher-level tools designed for analyzing data stored in Hadoop were built to principally operate on file-backed datasets stored in HDFS. Extensions of these tools to operate on data stored in HBase do not offer the same degree of integration or support that users are accustomed to. And direct use of MapReduce tends to be too cumbersome to be practical for exploratory analysis purposes.

Furthermore, the data model of HBase with large numbers of (sparsely-stored) columns per row and a third timestamp dimension tends to graft poorly onto the “traditional” row/column-based data model or a data model that expects to operate on (key, value) pairs or tuples, as seen in systems like Hive or Pig.

Regardless, large amounts of data – especially data about users intended for use in an online system such as an e-commerce site, gaming platform, or ad network – is stored in HBase, and data scientists must be able to perform investigative analysis on this information to better understand their business and improve these online processes. And the read/write model of HBase offers advantages over HDFS to the data scientist building complex analysis pipelines.

In this talk we will describe characteristics of how user data is stored in HBase and review types of analysis often relevant in exploratory or investigative analytic contexts. We will then describe best practices for modeling HBase-resident user data efficiently and survey tools and techniques available for exploring this data in an agile fashion appropriate for data science teams. Finally, we will describe our own team’s specific experience working with specific tool chains assembled for this purpose.

Photo of Aaron Kimball

Aaron Kimball

Zymergen, Inc.

Aaron is the Founder and CTO of WibiData, Inc., a software company that engineers solutions for the large-scale user-centric data challenges that face today’s enterprises. He is a committer on the Apache Hadoop project and has been working with Hadoop since 2007. Aaron previously worked at Cloudera, a company which provides an enterprise platform, support and services built around Hadoop. Aaron founded the open source Apache Sqoop data import tool and Apache MRUnit Hadoop testing library projects. Aaron holds a B.S. in Computer Science from Cornell University and a M.S. in Computer Science from the University of Washington.

Photo of Kiyan Ahmadizadeh

Kiyan Ahmadizadeh

WibiData, Inc.

Kiyan joined WibiData in 2011. He holds a BS in Computer Science from Penn State and a MS in Computer Science from Cornell. His graduate research focused on applying large scale data mining and machine learning to the areas of optimization and multi-agent systems. Kiyan enjoys writing, baking, comic books and video gaming when he gets the chance.

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com.

Media Partner Opportunities

For information on trade opportunities contact Kathy Yu at mediapartners
@oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts.