Avro Data

Doug Cutting (Cloudera)
Practitioner
Location: Mission City M
Average rating: ***..
(3.50, 8 ratings)

As storage costs have dropped, organizations can now afford to save the vast majority of data that passes through them. Systems like Hadoop’s MapReduce permit such data to be easily analyzed and mined to improve businesses. However classic data formats like CSV, XML and gzipped archives serve such uses poorly. Some have weak data models. Others support rich datastructures but are inefficient. Most integrate poorly with MapReduce.

Apache Avro data files define an expressive, efficient standard for representing large data collections. Avro supports rich, recursive datatypes and includes facilities for datatype evolution. In Avro, new datatypes may be processed and defined on the fly, useful from dynamic scripting and query languages. Avro data is compact and fast to process. Avro data files are compressed and MapReduce-friendly.

This talk will describe how Avro achieves these capabilities and how applications can start incorporating Avro data today.

Photo of Doug Cutting

Doug Cutting

Cloudera

Founder of Apache Lucene, Nutch, Hadoop and Avro projects.

Comments on this page are now closed.

Comments

Picture of Fred Dushin
Fred Dushin
02/03/2011 8:47am PST

Had you considered binary formats that have already been vetted in the industry? I like your use of IDL (even if it’s not OMG IDL); CORBA’s CDR encoding probably has many of the properties you need, though I’m less certain about its ability to support splitting. ASN.1 may have many of the properties you need, as well.

Anthony Cassandra
02/02/2011 10:48pm PST

A good talk laying out the context of why a data representation is important in the current “big data” ecosystem and the problems that existing formats have.

Sponsors

  • Thomson Reuters
  • EMC Data Computing Division
  • EnterpriseDB
  • Microsoft
  • Gnip
  • Rackspace Hosting
  • IBM
  • Windows Azure MarketPlace DataMarket
  • Amazon Mechanical Turk
  • Amazon Web Services
  • Aster Data
  • Cloudera
  • Clustrix
  • DataStax, Inc. (formerly Riptano, Inc.)
  • Digital Reasoning Systems
  • Heritage Provider Network
  • Impetus
  • Jaspersoft
  • Karmasphere
  • LinkedIn
  • MarkLogic
  • Pentaho
  • Pervasive
  • Revolution Analytics
  • Splunk
  • Urban Mapping
  • Wolfram|Alpha
  • Esri
  • ParAccel
  • Tableau Software

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Young at syoung@oreilly.com

Download the Strata Sponsor/Exhibitor Prospectus

Contact Us

View a complete list of Strata Contacts