Skip to main content

Hands-on training with the newest BDAS components: Learn BlinkDB, MLbase, Spark, Spark Streaming, GraphX, and Shark

Andy Konwinski (Databricks), Sameer Agarwal (UC Berkeley), Tathagata Das (Databricks), Ameet Talwalkar (Databricks), Shivaram Venkataraman (UC Berkeley), Patrick Wendell (Databricks), Reynold Xin (Databricks), Matei Zaharia (Databricks), Joseph Gonzalez (UC Berkeley), Haoyuan Li (UC Berkeley)
Hadoop and Beyond
GA Ballroom K
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Average rating: ***..
(3.10, 10 ratings)

This tutorial will provide hands-on training for BlinkDB, MLbase, Spark, GraphX, and Shark, components of the Berkeley Data Analytics Stack (BDAS). We will provide each audience member access to an EC2 cluster pre-loaded with real-world datasets, and walk them through hands-on exercises analyzing these data using the aforementioned technologies. The exercises will cover brand new components, including BlinkDB, a distributed query that provides ultra-low latency results via approximate, error bounded results. Another new component covered is MLbase, a platform for implementing and consuming machine learning algorithms at scale.

Additonally, we will learn to use more mature components of the stack including the Spark and Shark command line interfaces for ad-hoc analysis that take advantage of Spark’s in-memory caching primitives to speed up queries by an order of magnitude. The lessons will include Spark Streaming, the real-time component of Spark.

Photo of Andy Konwinski

Andy Konwinski

Postdoc, Databricks

Andy Konwinski is a postdoc in the AMPLab at UC Berkeley focused on large scale distributed computing and cluster scheduling. He co-created and is a committer on the Apache Mesos project that has been adopted by Twitter as their private cloud platform. He also worked with systems engineers and researchers at Google on Omega, their next generation cluster scheduling system. More recently, he lead the AMP Camp Big Data Bootcamp and has been contributing to the Spark project.

Photo of Sameer Agarwal

Sameer Agarwal

PhD Student, UC Berkeley

Sameer Agarwal is a Ph.D. student in the AMPLab at Berkeley working on large-scale approximate query processing frameworks. His research interests are at the intersection of distributed systems, databases and machine learning. He received his B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Guwahati and was awarded the President of India Gold Medal in 2009. He is supported by the Qualcomm Innovation Fellowship during 2012-13 and the Facebook Graduate Fellowship during 2013-14.

Photo of Tathagata Das

Tathagata Das

Lead developer of Spark Streaming and Software engineer, Databricks

Tathagata Das is a third-year Ph.D. student in the AMP Lab in UC Berkeley, working Scott Shenker and Ion Stoica. He leads the development of the Spark Streaming project. His research interests include datacenter networks and frameworks for large scale data processing. Before graduate school, he has worked as an Assistant Researcher in Microsoft Research Lab India.

Photo of Ameet Talwalkar

Ameet Talwalkar

Consultant , Databricks

Ameet Talwalkar is an NSF post-doctoral fellow in the Computer Science Division at UC Berkeley. His work focuses on devising scalable machine learning algorithms, and more recently, on interdisciplinary approaches for connecting advances in machine learning to large-scale problems in science and technology. He obtained a bachelor’s degree from Yale University and a Ph.D. from the Courant Institute at New York University.

Shivaram Venkataraman

PhD student, UC Berkeley

Shivaram Venkataraman is a second year PhD student at the University of California, Berkeley and works with Mike Franklin and Ion Stoica at the AMP Lab. His research
interests are in design of storage systems and analytics platforms for big-data applications. Before coming to Berkeley, he completed his M.S at the University of Illinois, Urbana-Champaign.

Photo of Patrick Wendell

Patrick Wendell

UC Berkeley Graduate Student, Databricks

Patrick Wendell is a Ph.D student working in the U.C. Berkeley AMPLab. His research focus is on large scale data-intensive computing and his adviser is Ion Stoica. Before working on the BDAS stack at Berkeley, he contributed to several Hadoop projects, mostly while working at Cloudera. He holds a B.S. in Computer Science from Princeton University.

Photo of Reynold Xin

Reynold Xin

Co-founder, Databricks

Reynold Xin is a PhD student in the AMP Lab at UC Berkeley. He leads the research and development of two open source systems: Shark, an analytical SQL engine that is up to 100X faster than Apache Hive; and SparkGraph, a distributed graph computation engine. He is a recipient of Best Demo Award from SIGMOD 2012 and Best Demo Award from VLDB 2011. Before graduate school, he worked on ads infrastructure at Google and distributed databases at IBM.

Photo of Matei Zaharia

Matei Zaharia

CTO, Databricks

Matei Zaharia started the Spark project at UC Berkeley and is currently CTO of Databricks. He serves as Spark’s vice president at Apache. In spring 2015, he is also beginning an assistant professor position at MIT.

Photo of Joseph Gonzalez

Joseph Gonzalez

Postdoc, UC Berkeley

Joseph is currently a postdoc in the AMPLab at UC Berkeley and co-founder of GraphLab Inc. Joseph received his PhD from the Machine Learning Department at Carnegie Mellon University where he worked with Carlos Guestrin on parallel algorithms and abstractions for scalable probabilistic machine learning. He is a recipient of the AT&T Labs Graduate Fellowship and the NSF Graduate Research Fellowship.

Photo of Haoyuan Li

Haoyuan Li

Ph.D. Candiate, UC Berkeley

Haoyuan Li is a Computer Science Ph.D. candidate in AMPLab at UC Berkeley, and he works with Prof. Scott Shenker and Prof. Ion Stoica on big data and cloud computing. He leads Tachyon, an open source memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks. He is a founding committer of Apache Spark and a co-creator of Spark Streaming. Before Berkeley, he worked at Conviva and Google, where he co-created PFPGrowth algorithm, which is included in Apache Mahout. Haoyuan has a M.S. from Cornell University and a B.S. from Peking University, both in Computer Science.

Comments on this page are now closed.


02/05/2014 1:41pm PST

is there anything we should set up ahead of time to make this tutorial go more smoothly?