Building Data Products with Hadoop

Sam Shah (LinkedIn)
Practitioner
Location: Mission City B5
Average rating: ***..
(3.78, 9 ratings)

Hadoop is responsible for computing a varying array of data
products at LinkedIn, including People You May Know (LinkedIn’s
people recommendation service), People Who Viewed This Also
Viewed (LinkedIn’s collaborative filtering), Who’s Viewed My
Profile?, Career Center, LinkedIn’s job recommendations, and
more. These products are immensely successful and extremely data
intensive: People You May Know, for example, generates a
significant portion of the invitations on LinkedIn, churning
through over 50 TB of data every day.

In this talk, I will detail the pieces of infrastructure that
allow us to make this happen (all open sourced), which will allow
an attendee to build their own data products. I will also give
tips & tricks that we have learned, sometimes painfully, along
the way. This talk is geared towards the intermediate Hadoop user
who perhaps has a few jobs that compute some data, but wants to
learn how to put this into a productionized process. There will
also be some nuggets for advanced users on how LinkedIn deals
with big data.

The talk will be subdivided into 4 “proverbs,” as follows.

  1. “Tall oaks grow from little acorns.” I will explain Azkaban,
    our Hadoop scheduler, which takes individual jobs and constructs
    them into flows that can be scheduled, restarted, and monitored.
  2. “Don’t put the cart before the horse.” Once data is computed in
    batch, we need to serve and update this data at scale. At
    LinkedIn, we use Voldemort, which serves the computed data on our
    website. A common pattern is to have data re-computed in Hadoop
    and pushed periodically, but the computed data to be read-only.
    The “read-only” extensions of Voldemort allow building stores
    offline in Hadoop and serving read-only traffic with low latency
    and high throughput.
  3. “A stitch in time saves nine.” The talk will also touch on how
    we are able to quickly iterate on our data models, and push new
    models to production. I’ll also discuss the necessities of data
    verification, and of being careful with how your data pipeline is
    constructed. I will present examples of how we perfunctorily
    constructed some jobs to our later detriment.
  4. “Half a loaf is better than none.” Now we have this process,
    how do we make it faster? The common performance bottleneck, and
    the one faced by most of LinkedIn’s data products, is
    intermediate data I/O. I will discuss the various measures we
    employ to deal with this, such as using Bloom filters for inexact
    joins, normalization of large keys, ``the curse of the last
    reducer’’, increasing map locality, etc.

Sam Shah

LinkedIn

Sam Shah is a Senior Software Engineer in the
Search, Network, and Analytics Team at LinkedIn,
working on applied data products. He is
particularly involved in the relevance backends
behind “People You May Know,” LinkedIn’s people
recommendation service, and LinkedIn’s
collaborative filtering system. He holds a Ph.D.
from the University of Michigan.

Comments on this page are now closed.

Comments

Anthony Cassandra
02/02/2011 10:45pm PST

I enjoyed this session, though it was less about Hadoop itself and more about the practical aspects of designing, developing and deploying large data analysis processes. Seeing all the real-world constraints laid out on top of the basic data-flow and the complexity this adds is an important, but often ignored or underestimated consideration in live running, real-world systems.

Sponsors

  • Thomson Reuters
  • EMC Data Computing Division
  • EnterpriseDB
  • Microsoft
  • Gnip
  • Rackspace Hosting
  • IBM
  • Windows Azure MarketPlace DataMarket
  • Amazon Mechanical Turk
  • Amazon Web Services
  • Aster Data
  • Cloudera
  • Clustrix
  • DataStax, Inc. (formerly Riptano, Inc.)
  • Digital Reasoning Systems
  • Heritage Provider Network
  • Impetus
  • Jaspersoft
  • Karmasphere
  • LinkedIn
  • MarkLogic
  • Pentaho
  • Pervasive
  • Revolution Analytics
  • Splunk
  • Urban Mapping
  • Wolfram|Alpha
  • Esri
  • ParAccel
  • Tableau Software

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Young at syoung@oreilly.com

Download the Strata Sponsor/Exhibitor Prospectus

Contact Us

View a complete list of Strata Contacts