Building Data Products with Hadoop

Sam Shah (LinkedIn)
Practitioner
Location: Mission City B5
Average rating: ***..
(3.78, 9 ratings)

Hadoop is responsible for computing a varying array of data products at LinkedIn, including People You May Know (LinkedIn’s people recommendation service), People Who Viewed This Also Viewed (LinkedIn’s collaborative filtering), Who’s Viewed My Profile?, Career Center, LinkedIn’s job recommendations, and more. These products are immensely successful and extremely data intensive: People You May Know, for example, generates a significant portion of the invitations on LinkedIn, churning through over 50 TB of data every day.

In this talk, I will detail the pieces of infrastructure that allow us to make this happen (all open sourced), which will allow an attendee to build their own data products. I will also give tips & tricks that we have learned, sometimes painfully, along the way. This talk is geared towards the intermediate Hadoop user who perhaps has a few jobs that compute some data, but wants to learn how to put this into a productionized process. There will also be some nuggets for advanced users on how LinkedIn deals with big data.

The talk will be subdivided into 4 “proverbs,” as follows.

  1. “Tall oaks grow from little acorns.” I will explain Azkaban, our Hadoop scheduler, which takes individual jobs and constructs them into flows that can be scheduled, restarted, and monitored.
  2. “Don’t put the cart before the horse.” Once data is computed in batch, we need to serve and update this data at scale. At LinkedIn, we use Voldemort, which serves the computed data on our website. A common pattern is to have data re-computed in Hadoop and pushed periodically, but the computed data to be read-only. The “read-only” extensions of Voldemort allow building stores offline in Hadoop and serving read-only traffic with low latency and high throughput.
  3. “A stitch in time saves nine.” The talk will also touch on how we are able to quickly iterate on our data models, and push new models to production. I’ll also discuss the necessities of data verification, and of being careful with how your data pipeline is constructed. I will present examples of how we perfunctorily constructed some jobs to our later detriment.
  4. “Half a loaf is better than none.” Now we have this process, how do we make it faster? The common performance bottleneck, and the one faced by most of LinkedIn’s data products, is intermediate data I/O. I will discuss the various measures we employ to deal with this, such as using Bloom filters for inexact joins, normalization of large keys, ``the curse of the last reducer’’, increasing map locality, etc.

Sam Shah

LinkedIn

Sam Shah is a Senior Software Engineer in the Search, Network, and Analytics Team at LinkedIn, working on applied data products. He is particularly involved in the relevance backends behind “People You May Know,” LinkedIn’s people recommendation service, and LinkedIn’s collaborative filtering system. He holds a Ph.D. from the University of Michigan.

Comments on this page are now closed.

Comments

Anthony Cassandra
02/02/2011 10:45pm PST

I enjoyed this session, though it was less about Hadoop itself and more about the practical aspects of designing, developing and deploying large data analysis processes. Seeing all the real-world constraints laid out on top of the basic data-flow and the complexity this adds is an important, but often ignored or underestimated consideration in live running, real-world systems.

Sponsors

  • Thomson Reuters
  • EMC Data Computing Division
  • EnterpriseDB
  • Microsoft
  • Gnip
  • Rackspace Hosting
  • IBM
  • Windows Azure MarketPlace DataMarket
  • Amazon Mechanical Turk
  • Amazon Web Services
  • Aster Data
  • Cloudera
  • Clustrix
  • DataStax, Inc. (formerly Riptano, Inc.)
  • Digital Reasoning Systems
  • Heritage Provider Network
  • Impetus
  • Jaspersoft
  • Karmasphere
  • LinkedIn
  • MarkLogic
  • Pentaho
  • Pervasive
  • Revolution Analytics
  • Splunk
  • Urban Mapping
  • Wolfram|Alpha
  • Esri
  • ParAccel
  • Tableau Software

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Young at syoung@oreilly.com

Download the Strata Sponsor/Exhibitor Prospectus

Contact Us

View a complete list of Strata Contacts