An Introduction to the Berkeley Data Analytics Stack (BDAS) Featuring Spark, Spark Streaming, and Shark - Part 1

Ion Stoica (UC Berkeley), Matei Zaharia (Databricks), Reynold Xin (Databricks), Shivaram Venkataraman (UC Berkeley), Andy Konwinski (UC Berkeley), Tathagata Das (University of California Berkeley)
Beyond Hadoop Ballroom G
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Average rating: *****
(5.00, 3 ratings)

This tutorial-the first of a two-part series-will provide an introduction to BDAS, the Berkeley Data Analytics Stack. BDAS is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos. We will start by covering Spark, a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100x thanks to its ability to perform computations in memory. Spark provides concise, high-level APIs in both Scala and Java, and is in use at Foursquare, Conviva, Klout, Quantifind, and other companies. We will provide an overview of the Spark architecture, typical data analytics workflows (e.g., loading data from HDFS into memory and interactively querying it), and how users are applying Spark. In addition, we will also introduce Shark, a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100x faster than Hive without modification to the data and queries, and is also open source as part of BDAS.

Ion Stoica

UC Berkeley

Ion Stoica is a Professor of Computer Science at UC Berkeley, where he does research on cloud computing and networked computer systems. Past work includes the Dynamic Packet State (DPS), Chord DHT, Internet Indirection Infrastructure (i3), declarative networks, replay-debugging, and multi-layer tracing in distributed systems. His current research includes resource management and scheduling for data centers, cluster computing frameworks, and network architectures. He is the recipient of a SIGCOMM Test of Time Award, the CoNEXT Rising Star Award, the PECASE Award, and the ACM doctoral dissertation award. Ion also co-founded Conviva, a startup to commercialize technologies for large scale video distribution.

Photo of Matei Zaharia

Matei Zaharia

Databricks

Matei Zaharia is a fifth-year PhD student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in cloud computing, operating systems, networking, and algorithms for large-scale data processing. He is the lead developer of the Spark programming framework, and also a committer on Apache Mesos and Apache Hadoop. He got his undergraduate degree at the University of Waterloo in Canada.

Photo of Reynold Xin

Reynold Xin

Databricks

Reynold Xin is a third-year PhD student in the AMP Lab at UC Berkeley. He leads the development of the Shark project, which won the Best Demo Award at SIGMOD 2012. He is also the recipient of the inaugural Best Demo Award at VLDB 2011 for his work on the CrowdDB system. Before graduate school, he worked on ads infrastructure at Google and distributed databases at IBM. His interests include data management systems, distributed systems, and algorithms for large-scale data processing.

Shivaram Venkataraman

UC Berkeley

Shivaram Venkataraman is a second year PhD student at the University of California, Berkeley and works with Mike Franklin and Ion Stoica at the AMP Lab. His research interests are in design of storage systems and analytics platforms for big-data applications. Before coming to Berkeley, he completed his M.S at the University of Illinois, Urbana-Champaign.

Photo of Andy Konwinski

Andy Konwinski

UC Berkeley

Andy Konwinski is a postdoc in the AMPLab at UC Berkeley focused on large scale distributed computing and cluster scheduling. He co-created and is a committer on the Apache Mesos project that has been adopted by Twitter as their private cloud platform. He also worked with systems engineers and researchers at Google on Omega, their next generation cluster scheduling system. More recently, he lead the AMP Camp Big Data Bootcamp and has been contributing to the Spark project.

Photo of Tathagata Das

Tathagata Das

University of California Berkeley

Tathagata Das is a fourth-year Ph.D. student in the AMP Lab in UC Berkeley, working Scott Shenker and Ion Stoica. He leads the development of the Spark Streaming project. His research interests include datacenter networks and frameworks for large scale data processing. Before graduate school, he has worked as an Assistant Researcher in Microsoft Research Lab India.

Comments on this page are now closed.

Comments

Roy J.Swagger
07/07/2013 8:50pm PDT

I am wondering since shark uses a in-memory model, then how to fit big data situation, like PB,EB? Thanks

Picture of Andy Konwinski
Andy Konwinski
02/27/2013 5:06am PST
Picture of Andy Konwinski
Andy Konwinski
02/27/2013 5:04am PST

Slides are now available at http://ampcamp.berkeley.edu/amp-camp-strata-2013/

Daniel Garcia
02/26/2013 10:41pm PST

Please post the last presentation on streaming spark.

Boris Klots
02/26/2013 1:27pm PST

Also wanted to ask for the preso. So the demand is building up…

Shivaram Venkataraman
02/26/2013 12:18pm PST

Yes. The slides will be available here soon.

Robert Towne
02/26/2013 11:26am PST

will the slides be available on this page?

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts