Navigating the Data Pipeline

Tim Moreton (Acunu)
Real-time
Location: Sutton South

At the heart of every system that harnesses big data is a pipeline that comprises collecting large volumes of raw data, extract value from it through analytics or data transformations, then delivering that condensed set of results back out — potentially to millions of users.

This talk examines the challenges of building manageable, robust pipelines — a great simplifying paradigm that will help participants looking to architect their own big data systems.

I’ll look at what you want from each of these stages — using Google Analytics as a canonical big data example, as well as case studies of systems deployed at LinkedIn. I’ll look at how collecting, analyzing and serving data pose conflicting demands on the storage and compute components of the underlying hardware. I’ll talk about what available tools do to address these challenges.

I’ll move on to consider two holy grails: real-time analytics, and dual data center support. The pipeline metaphor highlights a challenge in deriving real-time value from huge datasets: I’ll explore what happens when you compose multiple, segregated platforms into a single pipeline, and how you can dodge the issue with a ‘fast’ and ‘slow’ two-tier architecture. Then I’ll look at how you can figure dual data center support into the design, particularly important for highly available deployments on the cloud.

In summary, this talk will present a useful metaphor for architecting big data systems, and describe using deployed examples how to go about fitting together the tools available to fit a range of settings.

Photo of Tim Moreton

Tim Moreton

Acunu

Tim is Founder and CEO at Acunu, where he works to close the gap between the demands of the data deluge and the opportunity afforded by modern hardware. He holds a PhD in Computer Science from Cambridge University, where he worked on distributed storage systems as part of the project that created Xen, the virtualisation platform, as part of an early cloud blueprint. Previously, Tim ran a consultancy delivering data analytics for air traffic control and was a senior member of the technical team at Tideway (now BMC), where he led the creation of solutions for managing data centres at Fortune 500 clients.

Sponsors

  • Aster Data
  • EMC Greenplum
  • GE
  • Lexis Nexis
  • MarkLogic
  • Tableau Software
  • Cloudera
  • DataStax
  • Informatica
  • DataSift
  • Splunk
  • Amazon Web Services
  • Datameer
  • Impetus
  • Karmasphere
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Sybase
  • Xeround
  • Media-Science
  • Platfora

Sponsorship Opportunities

For information on sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata Contacts