Hands-on with BDAS - Learn Spark, Spark Streaming and Shark via Real Data Analysis - Part 2

Matei Zaharia (Databricks), Reynold Xin (Databricks), Andy Konwinski (UC Berkeley), Tathagata Das (Databricks), Patrick Wendell (Databricks)
Beyond Hadoop Room 204
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Average rating: ****.
(4.00, 1 rating)

This tutorial follows up on our previous tutorial introducing BDAS, the open-source Berkeley Data Analytics Stack. Attendees will use Spark and Shark, two key components of BDAS, to manipulate a real-world Wikipedia dataset. We will provide each audience member access to a Spark/Shark cluster running on EC2 and walk them through hands-on coding examples. Attendees will learn how to use the Spark and Shark command line interfaces to perform ad-hoc analysis that take advantage of Spark’s in-memory caching primitives to speed up queries by an order of magnitude. The lessons will include practice using Spark’s Java and Scala language APIs and Shark’s SQL-like query language. Additionally, users will write a more complex standalone Spark program that uses a parallel machine learning algorithm (K-Means Clustering) to analyze a real Wikipedia dataset.

Photo of Matei Zaharia

Matei Zaharia


Matei Zaharia is a fifth-year PhD student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in cloud computing, operating systems, networking, and algorithms for large-scale data processing. He is the lead developer of the Spark programming framework, and also a committer on Apache Mesos and Apache Hadoop. He got his undergraduate degree at the University of Waterloo in Canada.

Photo of Reynold Xin

Reynold Xin


Reynold Xin is a third-year PhD student in the AMP Lab at UC Berkeley. He leads the development of the Shark project, which won the Best Demo Award at SIGMOD 2012. He is also the recipient of the inaugural Best Demo Award at VLDB 2011 for his work on the CrowdDB system. Before graduate school, he worked on ads infrastructure at Google and distributed databases at IBM. His interests include data management systems, distributed systems, and algorithms for large-scale data processing.

Photo of Andy Konwinski

Andy Konwinski

UC Berkeley

Andy Konwinski is a postdoc in the AMPLab at UC Berkeley focused on large scale distributed computing and cluster scheduling. He co-created and is a committer on the Apache Mesos project that has been adopted by Twitter as their private cloud platform. He also worked with systems engineers and researchers at Google on Omega, their next generation cluster scheduling system. More recently, he lead the AMP Camp Big Data Bootcamp and has been contributing to the Spark project.

Photo of Tathagata Das

Tathagata Das


Tathagata Das is a Apache Spark Committer and a member of the PMC. He is the lead developer of behind Spark Streaming, and currently employed at Databricks. Earlier, he has spent in the AMPLab of UC Berkeley, research about datacenter frameworks and networks with professors Scott Shenker and Ion Stoica.

Photo of Patrick Wendell

Patrick Wendell


Patrick Wendell is a Ph.D student working in the U.C. Berkeley AMPLab. His research focus is on large scale data-intensive computing and his adviser is Ion Stoica. Before working on the BDAS stack at Berkeley, he contributed to several Hadoop projects, mostly while working at Cloudera. He holds a B.S. in Computer Science from Princeton University.


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts