This tutorial will provide hands-on training for BlinkDB, MLbase, Spark, GraphX, and Shark, components of the Berkeley Data Analytics Stack (BDAS). We will provide each audience member access to an EC2 cluster pre-loaded with real-world datasets, and walk them through hands-on exercises analyzing these data using the aforementioned technologies. The exercises will cover brand new components, including BlinkDB, a distributed query that provides ultra-low latency results via approximate, error bounded results. Another new component covered is MLbase, a platform for implementing and consuming machine learning algorithms at scale.
Additonally, we will learn to use more mature components of the stack including the Spark and Shark command line interfaces for ad-hoc analysis that take advantage of Spark’s in-memory caching primitives to speed up queries by an order of magnitude. The lessons will include Spark Streaming, the real-time component of Spark.
Andy Konwinski is a postdoc in the AMPLab at UC Berkeley focused on large scale distributed computing and cluster scheduling. He co-created and is a committer on the Apache Mesos project that has been adopted by Twitter as their private cloud platform. He also worked with systems engineers and researchers at Google on Omega, their next generation cluster scheduling system. More recently, he lead the AMP Camp Big Data Bootcamp and has been contributing to the Spark project.
Sameer Agarwal is a Ph.D. student in the AMPLab at Berkeley working on large-scale approximate query processing frameworks. His research interests are at the intersection of distributed systems, databases and machine learning. He received his B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Guwahati and was awarded the President of India Gold Medal in 2009. He is supported by the Qualcomm Innovation Fellowship during 2012-13 and the Facebook Graduate Fellowship during 2013-14.
Tathagata Das is a third-year Ph.D. student in the AMP Lab in UC Berkeley, working Scott Shenker and Ion Stoica. He leads the development of the Spark Streaming project. His research interests include datacenter networks and frameworks for large scale data processing. Before graduate school, he has worked as an Assistant Researcher in Microsoft Research Lab India.
Ameet Talwalkar is an NSF post-doctoral fellow in the Computer Science Division at UC Berkeley. His work focuses on devising scalable machine learning algorithms, and more recently, on interdisciplinary approaches for connecting advances in machine learning to large-scale problems in science and technology. He obtained a bachelor’s degree from Yale University and a Ph.D. from the Courant Institute at New York University.
Shivaram Venkataraman is a second year PhD student at the University of California, Berkeley and works with Mike Franklin and Ion Stoica at the AMP Lab. His research
interests are in design of storage systems and analytics platforms for big-data applications. Before coming to Berkeley, he completed his M.S at the University of Illinois, Urbana-Champaign.
Patrick Wendell is a Ph.D student working in the U.C. Berkeley AMPLab. His research focus is on large scale data-intensive computing and his adviser is Ion Stoica. Before working on the BDAS stack at Berkeley, he contributed to several Hadoop projects, mostly while working at Cloudera. He holds a B.S. in Computer Science from Princeton University.
Reynold Xin is a PhD student in the AMP Lab at UC Berkeley. He leads the research and development of two open source systems: Shark, an analytical SQL engine that is up to 100X faster than Apache Hive; and SparkGraph, a distributed graph computation engine. He is a recipient of Best Demo Award from SIGMOD 2012 and Best Demo Award from VLDB 2011. Before graduate school, he worked on ads infrastructure at Google and distributed databases at IBM.
Matei Zaharia is an assistant professor of computer science at MIT, and the initial creator of Apache Spark. He is currently on industry leave to start Databricks, a company commercializing Spark, where he is CTO.
Joseph is currently a postdoc in the AMPLab at UC Berkeley and co-founder of GraphLab Inc. Joseph received his PhD from the Machine Learning Department at Carnegie Mellon University where he worked with Carlos Guestrin on parallel algorithms and abstractions for scalable probabilistic machine learning. He is a recipient of the AT&T Labs Graduate Fellowship and the NSF Graduate Research Fellowship.
Haoyuan Li is a Computer Science Ph.D. candidate in AMPLab at UC Berkeley, and he works with Prof. Scott Shenker and Prof. Ion Stoica on big data and cloud computing. He leads Tachyon, an open source memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks. He is a founding committer of Apache Spark and a co-creator of Spark Streaming. Before Berkeley, he worked at Conviva and Google, where he co-created PFPGrowth algorithm, which is included in Apache Mahout. Haoyuan has a M.S. from Cornell University and a B.S. from Peking University, both in Computer Science.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For exhibition and sponsorship opportunities, contact Susan Stewart at email@example.com
For information on trade opportunities with O'Reilly conferences, email firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata contacts