Skip to main content

An Introduction to the Berkeley Data Analytics Stack With Spark, Spark Streaming, Shark, Tachyon, and BlinkDB

Tathagata Das (Databricks), Haoyuan Li (UC Berkeley), Ion Stoica (UC Berkeley), Reynold Xin (Databricks), Sameer Agarwal (UC Berkeley)
Hadoop & Beyond Rhinelander South
Tutorial Please note: to attend, your registration must include Tutorials on Monday.
Average rating: ****.
(4.80, 10 ratings)
Slides:   1-PDF    2-PPTX    3-PPTX    4-PPTX    5-PPT    6-PDF 

This tutorial will introduce BDAS, the Berkeley Data Analytics Stack. BDAS is an open source, next-generation software stack being developed by the UC Berkeley AMPLab in collaboration with several leading technology companies. It aims to tackle two major challenges in data analytics – the need for lower-latency processing (e.g. streaming and interactive queries) and more complex analytics (e.g. graph and machine learning) – while staying compatible with the Hadoop stack. ’

In this tutorial, we will survey the following components of BDAS and show how each one can be used in real applications:

  • Apache Spark, a high-speed cluster computing system compatible with Hadoop that can run 100x faster thanks to its ability to perform computations in memory. Spark provides concise, high-level APIs in Scala, Python and Java.
  • Spark Streaming, which provides highly scalable, fault-tolerant real-time processing.
  • Shark, a low-latency SQL query engine that is compatible with Apache Hive, but can run more than 100x faster.
  • BlinkDB, an approximate query engine that allows users to trade off latency vs accuracy.
  • Tachyon, an in memory distributed storage system that provides HDFS API.

Many of the components are already in use in organizations large and small, including Yahoo!, Adobe, Intel, Conviva, Ooyala, Bizo, Baidu, Alibaba.

Photo of Tathagata Das

Tathagata Das

Databricks

Tathagata Das is a third-year Ph.D. student in the AMP Lab in UC Berkeley, working Scott Shenker and Ion Stoica. He leads the development of the Spark Streaming project. His research interests include datacenter networks and frameworks for large scale data processing. Before graduate school, he has worked as an Assistant Researcher in Microsoft Research Lab India.

Photo of Haoyuan Li

Haoyuan Li

UC Berkeley

Haoyuan Li is a Computer Science Ph.D. candidate in AMPLab at UC Berkeley, and he works with Prof. Scott Shenker and Prof. Ion Stoica on big data and cloud computing. He leads Tachyon, an open source memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks. He is a founding committer of Apache Spark and a co-creator of Spark Streaming. Before Berkeley, he worked at Conviva and Google, where he co-created PFPGrowth algorithm, which is included in Apache Mahout. Haoyuan has a M.S. from Cornell University and a B.S. from Peking University, both in Computer Science.

Ion Stoica

UC Berkeley

Ion Stoica is a Professor of Computer Science at UC Berkeley, where he does research on cloud computing and networked computer systems. Past work includes the Dynamic Packet State (DPS), Chord DHT, Internet Indirection Infrastructure (i3), declarative networks, replay-debugging, and multi-layer tracing in distributed systems. His current research includes resource management and scheduling for data centers, cluster computing frameworks, and network architectures. He is the recipient of a SIGCOMM Test of Time Award, the CoNEXT Rising Star Award, the PECASE Award, and the ACM doctoral dissertation award. Ion also co-founded Conviva, a startup to commercialize technologies for large scale video distribution.

Photo of Reynold Xin

Reynold Xin

Databricks

Reynold Xin is an Apache Spark committer and the lead developer for Shark and GraphX, two computation frameworks built on top of Spark. He is also a co-founder of Databricks. Before Databricks, he was pursuing a PhD focusing on large scale data systems in the UC Berkeley AMPLab.

Photo of Sameer Agarwal

Sameer Agarwal

UC Berkeley

Sameer Agarwal is a final year Ph.D. student in the AMPLab at Berkeley working on large-scale approximate query processing frameworks. His research interests are at the intersection of distributed systems, databases and machine learning, and he has published over 10 articles in various top-tier conferences including NSDI, EUROSYS, SIGMOD, VLDB and KDD. He received his B.Tech in Computer Science and Engineering from the Indian Institute of Technology and was awarded the President of India Gold Medal in 2009. He was supported by the Qualcomm Innovation Fellowship during 2012-13 and is supported by the Facebook Graduate Fellowship during 2013-14.

Comments on this page are now closed.

Comments

Picture of Reynold Xin
Reynold Xin
11/01/2013 9:50pm EDT

Hi Aniket,

Just uploaded all the slides.

Aniket Adnaik
11/01/2013 5:50pm EDT

Is it possible to get slides for this tutorial? (An Introduction to the Berkeley Data Analytics Stack With Spark, Spark Streaming, Shark, Tachyon, and BlinkDB) ?

Bill Bejeck
10/28/2013 11:53am EDT

Can the slides from various presentations be made available soon?

Picture of Sophia DeMartini
Sophia DeMartini
10/27/2013 8:49pm EDT

Hi Saurabh – no, there will be no installations or downloads required for this tutorial. Just come and listen!

saurabh agarwal
10/27/2013 8:36pm EDT

Do we need to download any tutorial or software for the session tomorrow?

Cesar Rojas
10/27/2013 11:45am EDT

Looking forward to attend to this session.

Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners
@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts