Hadoop and Hive have enabled companies to do web-scale analytics in very large enterprise data warehouses. In most cases, the data are loaded from other sources (databases and logs) in batch mode periodically. While it is efficient to process a large data set this way in map reduce system, there is usally a significant latency from the time that events happened to the time that data is available for query, and therefore the business metrics.
Morse is a new system built in Facebook Data Infrastruture, that enabled us to continuously load data from sharded mysql dbs or other data sources into hadoop / hive data warehouse in realtime. This not only reduced data delivery time in batch processing, but also saved multiple copies of full data set in intermediate processing and merging, enabled realtime data analytics, and improved system reliability. HBase is used as underlying storage for incrementally updated table, while the data is exposed as external table into Hive for analytics query. Morse bridges the gap between batch processing in Hadoop and incremental udpates in HBase.
This talk will cover the system architecture of Morse, key design decisions, and discuss the empirical results, on the efficiency, latency, performance, and reliability of the new data ETL pipeline.
Jun Fang has spent 10+ years in Microsoft SQL Server engine team, worked on both relational and storage engine, led a dev team to build various features in language, runtime, storage, and management. Later he joined Bing Platform team, worked in Cosmos, leading a team to build large scale distributed table storage system with transactions. Since joining Facebook in 2012, Jun worked in Data Infrastructure, built technologies that transformed the data ETL pipeline from daily batch to incremental and realtime.
For exhibition and sponsorship opportunities, contact Susan Stewart at firstname.lastname@example.org
For information on trade opportunities with O'Reilly conferences email mediapartners
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata + Hadoop World 2013 contacts