Data Availability and Integrity in Apache Hadoop

Steve Loughran (Hortonworks)
Hadoop: Tools & Technology
Location: Room 1-6
Average rating: *****
(5.00, 2 ratings)

Although Hadoop is designed to be resilient to the loss of hard disks or individual servers, the failure of core services can make the cluster temporarily unavailable, while other failures in a datacentre may lead to the permanent loss of data.

This talk looks at the risks, from the hardware to the entire software stack, using real data from customer sites to estimate their likelihood.

It introduces the best practices for availability, failure recovery and disaster recovery for Hadoop clusters.

Finally, it covers ongoing work for High Availability in the Hadoop platform, including filesystem snapshots and disaster recovery.

Subtopics

How does Hadoop fail?
  1. Hardware: RAM, HDD, Network
  2. Core Services: the Name Node, Job Tracker, other services.
  3. “Disasters” -what can go really wrong

Is this a real threat?

Hard data from Yahoo!, Hortonworks customers and published research from Google and Microsoft shows which risks matter the most

What can be done about this? * what are the best practises for keeping data intact and available? * what should the disaster recovery plan for a Hadoop cluster be.

How can Hadoop get better?

What are the recent changes to Hadoop that mitigate some of the risks -including improved failover and recovery of the core services, filesystem snapshots and other new features.

Conclusions

Real-world data shows that there are small yet measurable risks to the availability of a Hadoop cluster -and the actual data within it. Recent changes to the Hadoop platform will reduce this risk, but an understanding of the risks and strategies to mitigate the risks are still essential.

Steve Loughran

Hortonworks

Steve Loughran is a member of technical staff at Hortonworks, where he works on leading-edge issues with the Hadoop ecosystem, including service failure modes and availability.

Prior to joining Hortonworks he worked at HP Laboratories on large-scale distributed systems, including cloud computing infrastructures. He is the author of Ant in Action, and is one of the very few UK-based Hadoop committers.

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com or +1 (707) 827-7148

Media Partner Opportunities

For information on trade opportunities contact Kathy Yu at mediapartners
@oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts.