Building a Large-scale Data Collection System Using Flume NG

Hari Shreedharan (Cloudera Inc.), Will McQueen (Cloudera Inc.), Arvind Prabhakar (Cloudera), Prasad Mujumdar (Cloudera Inc.), Mike Percy (Cloudera)
Hadoop: Tools & Technology, Regent Parlor (NY Hilton)
Tutorial Please note: to attend, your registration must include Tutorials.
Average rating: ***..
(3.00, 4 ratings)

Hadoop HDFS is typically adopted in situations where traditional storage and database systems are either reaching their limits or have already surpassed them. This usually implies that there are one or more large streams of events that need to be collected, such as log data streams. Flume NG was designed from the ground-up to tackle this problem in a straightforward, scalable, reliable way, and empirical results support the success of its approach.

At a high level, Flume NG has a simple well-designed architecture consisting of a set of agents with each agent running any number of sources, channels (event buffers), and sinks. Flume agents can easily be chained across the network to provide a configurable pipeline through which discrete events flow reliably from source (i.e., an application server) to destination (i.e., a Hadoop HDFS cluster). Flume can be configured to support arbitrary data flows, including fan-in (data aggregation) and fan-out (data replication) designs. Such designs are primarily an artifact of the generality of the agent-based architecture.

In this tutorial, a group of people closely involved with Flume walk participants through setting up a typical data collection infrastructure using Flume. We first describe the basic architecture of Flume including its design, the transactional semantics it supports for reliability, and the sources, channels, and sinks included with the Flume core. We then move on to a brief description of common data flow architectures, and choose a typical data collection scenario for which we use Flume to do the heavy lifting. Next we come to the main body of this tutorial session, which is a walkthrough of installing, configuring, and tuning a scalable, reliable, and fault-tolerant Flume-based data collection system for storing events into a Hadoop system in real time.

Throughout this presentation we also cover: (1) how to configure Flume to store data on a secure HDFS cluster, (2) configuration options used to trade off between performance and fault tolerance, (3) Avro support, (4) Flume extension points, plugins, and hooks, (5) Flume compatibility with various versions of Hadoop, (6) performance benchmarks, and (7) general best practices for using Flume NG effectively.

Photo of Hari Shreedharan

Hari Shreedharan

Cloudera Inc.

Hari is a Software Engineer at Cloudera, where he is working on building Apache Flume. Previously, Hari was a software engineer on Yahoo! Mail’s metadata indexing and query team. He holds a Masters from Cornell University in Computer Science.

Will McQueen

Cloudera Inc.

Will is a software engineer at Cloudera.

Photo of Arvind Prabhakar

Arvind Prabhakar

Cloudera

Arvind is the PMC Chair for Apache Sqoop and a committer and PPMC member of Apache Flume. A seasoned enterprise software developer, Arvind has worked at Netscape, Sun Microsystems, Informatica and currently at Cloudera.

Photo of Prasad Mujumdar

Prasad Mujumdar

Cloudera Inc.

Prasad is a Software Engineer at Cloudera. He is also a committer on Apache Flume.

Photo of Mike Percy

Mike Percy

Cloudera

Mike Percy is a Software Engineer at Cloudera. Previously, he worked on Yahoo!’s C.O.R.E team.

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com.

Media Partner Opportunities

For information on trade opportunities contact Kathy Yu at mediapartners
@oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts.