Skip to main content

Faster and Smarter Big Data Analysis with BlinkDB, MLbase, GraphX, and Tachyon: New Components of the Berkeley Data Analytics Stack (BDAS)

Sameer Agarwal (UC Berkeley), Tathagata Das (University of California Berkeley), Ali Ghodsi (UC Berkeley), Ion Stoica (UC Berkeley), Ameet Talwalkar (UC Berkeley), Reynold Xin (Databricks), Matei Zaharia (Databricks), Joseph Gonzalez (UC Berkeley)
Hadoop and Beyond
GA Ballroom K
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Average rating: ****.
(4.29, 7 ratings)

This tutorial, the first of a two-part big data analysis training series, will present three new cutting edge components of the Berkeley Data Analytics Stack (BDAS), as well as provide a brief introduction to the stack as a whole. Currently under development in the UC Berkeley AMPLab, BDAS currently contains eight components that are tightly integrated with each other and with popular components of the Hadoop ecosystem. We will start by covering Spark (a high-speed cluster computing system engine), Spark Streaming (the real-time processing system built on Spark), and Shark (the SQL component on top of Spark). Then we will dive into four newly released components of BDAS. First, BlinkDB, a distributed query that provides ultra-low latency results via approximate, error bounded results. Second, MLbase, a platform for implementing and consuming machine learning algorithms at scale. Third, GraphX extends the distributed fault-tolerant collections API and interactive console of Spark with a new graph API which leverages recent advances in graph systems (e.g., Giraph and GraphLab) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. Finally, Tachyon is a fault tolerant distributed in-memory file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and Hadoop MapReduce.

Photo of Sameer Agarwal

Sameer Agarwal

PhD Student, UC Berkeley

Sameer is a PhD student in the AMPLab at UC Berkeley. He actively collaborated with Microsoft Researchers on RoPE, an optimizer for parallel executions that has been successfully deployed on production clusters at Microsoft Bing. He completed his undergraduate education in the Department of Computer Science and Engineering at the Indian Institute of Technology, Guwahati in 2009 and was awarded the prestigious President of India Gold Medal.

Photo of Tathagata Das

Tathagata Das

Graduate Student, University of California Berkeley

Tathagata Das is a third-year Ph.D. student in the AMP Lab in UC Berkeley, working Scott Shenker and Ion Stoica. He leads the development of the Spark Streaming project. His research interests include datacenter networks and frameworks for large scale data processing. Before graduate school, he has worked as an Assistant Researcher in Microsoft Research Lab India.

Ali Ghodsi

, UC Berkeley

Ali Ghodsi is an Assistant Professor at KTH/Royal Institute of Technology in Sweden and a Visiting Researcher at UC Berkeley since 2009. His general interests are in the broader areas of distributed systems and networking. He received his PhD in 2006 from KTH/Royal Institute of Technology in the area of Distributed Computing.

Ion Stoica

Professor, UC Berkeley

Ion Stoica is a Professor of Computer Science at UC Berkeley, where he does research on cloud computing and networked computer systems. Past work includes the Dynamic Packet State (DPS), Chord DHT, Internet Indirection Infrastructure (i3), declarative networks, replay-debugging, and multi-layer tracing in distributed systems. His current research includes resource management and scheduling for data centers, cluster computing frameworks, and network architectures. He is the recipient of a SIGCOMM Test of Time Award, the CoNEXT Rising Star Award, the PECASE Award, and the ACM doctoral dissertation award. Ion also co-founded Conviva, a startup to commercialize technologies for large scale video distribution.

Photo of Ameet Talwalkar

Ameet Talwalkar

, UC Berkeley

Ameet Talwalkar is an NSF post-doctoral fellow in the Computer Science Division at UC Berkeley. His research focuses on devising scalable machine learning algorithms, and more recently, on interdisciplinary approaches for connecting advances in machine learning to large-scale problems in science and technology. He graduated summa cum laude from Yale University and obtained his Ph.D. at New York University. He was awarded the Janet Fabri Prize for the best doctoral dissertation in NYU’s Computer Science Department, Yale’s undergraduate prize in Computer Science, and a Westinghouse Science Talent Search Scholarship.

Photo of Reynold Xin

Reynold Xin

Co-founder, Databricks

Reynold Xin is a PhD student in the AMP Lab at UC Berkeley. He leads the research and development of two open source systems: Shark, an analytical SQL engine that is up to 100X faster than Apache Hive; and SparkGraph, a distributed graph computation engine. He is a recipient of Best Demo Award from SIGMOD 2012 and Best Demo Award from VLDB 2011. Before graduate school, he worked on ads infrastructure at Google and distributed databases at IBM.

Photo of Matei Zaharia

Matei Zaharia

CTO, Databricks

Matei Zaharia is an assistant professor of computer science at MIT, and the initial creator of Apache Spark. He is currently on industry leave to start Databricks, a company commercializing Spark, where he is CTO.

Photo of Joseph Gonzalez

Joseph Gonzalez

Postdoc, UC Berkeley

Joseph is currently a postdoc in the AMPLab at UC Berkeley and co-founder of GraphLab Inc. Joseph received his PhD from the Machine Learning Department at Carnegie Mellon University where he worked with Carlos Guestrin on parallel algorithms and abstractions for scalable probabilistic machine learning. He is a recipient of the AT&T Labs Graduate Fellowship and the NSF Graduate Research Fellowship.