How Crunch Makes Writing, Testing and Running of MapReduce Pipelines Easy, Efficient and Even Fun!

Josh Wills (Cloudera)

Tools like Pig, Hive, and Cascading ease the burden of writing MapReduce pipelines by defining Tuple-oriented data models and providing support for filtering, joining and aggregating those records. However, there are many data sets that do not naturally fit into the Tuple model, such as images, time series, audio files and seismograms. To process data in these binary formats, developers often go back to writing MapReduces using the low-level Java APIs.

In this session, Cloudera Data Scientist Josh Wills will share insights and “how to” tricks about Crunch, a Java library that aims to make writing, testing and running MapReduce pipelines that run over any type of data easy, efficient and even fun. Crunch’s design is modeled after Google’s FlumeJava library and focuses on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution on the Hadoop cluster.

Photo of Josh Wills

Josh Wills

Cloudera

Josh Wills is the director of data science at Cloudera. Wills is one of the main contributors to Cloudera’s most recent open source project, Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun.

Prior to joining Cloudera, Wills was a software engineer at Google. Josh holds a M.S.E. in operations research from the University of Texas and a BS in mathematics from Duke University.

Comments on this page are now closed.

Comments

Picture of Dean Wampler
Dean Wampler
03/02/2012 7:24am PST

Hey Josh. Any plans to post your slides?

Sponsors

  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

View a complete list of Strata contacts