Aggregating, processing and making sense of rapidly-generated data continuously from hundreds or sometimes thousands of sources can be very inefficient, expensive, stressful and at times flat out intimidating.This 3-hour hands-on tutorial will begin by talking about the various sources and types of data we can collect such as:
We will then talk about the general architecture of Apache Flume describing briefly the usage each of its various components as well as their place within the Flume NG Architecture.
Then we will walk through various Source components within Flume that can be used to capture the data.
Here we will show participants how to configure the sources to capture, analyze and filter the data in real time.
We will also show code samples on how to create custom sources to capture data from virtually any source compatible with the architecture.
We will then illustrate how to configure the channels within Flume to temporarily store the captured events from the Sources until they can be picked up by the Sinks.
We will go through the advantages and disadvantages of each channel type and how to create a custom channel of your own to suit your needs.
Once we are done with the channels used for temporary storage of captured events, we will discuss the various sinks available within Flume.
In this tutorial we will show the events from the channels are picked up and sent to the Sinks.
We will discuss how to configure and use a variety of Sinks including but not limited to the following:
We will also talk about how to create custom sinks to set up centralized storage with virtually any compatible backend datastore.
This section will focus on how to configure the sinks.
Once the data is in the sinks, we will discuss strategies for processing the data stored in HDFS, ElasticSearch and Neo4j.
We will then focus on how to search and query the data stored in ElasticSearch and Neo4j.
A picture is worth 1024 words.
Once the query results are retreived, we will process and format it in a structure that will simplify the presentation process.
The processed data will then be visualized using D3.js, SVG and CSS.
We will also show how to stream the processed data in realtime to a modern browser using HTML5 WebSockets.
Israel Ekpo is a seasoned and experienced software engineer, computer scientist, big data enthusiast and data science practitioner. He uses and/or contributes to a variety of open source projects including but not limited to Apache Lucene, Apache Solr, ElasticSearch, Apache Flume, Mahout, Hadoop, HBase , MongoDB, CouchBase, Neo4j, and Apache Hive.
Comments on this page are now closed.
For exhibition and sponsorship opportunities, contact Susan Stewart at email@example.com
For information on trade opportunities with O'Reilly conferences email mediapartners
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata + Hadoop World 2013 contacts