This tutorial, the first of a two-part big data analysis training series, will present three new cutting edge components of the Berkeley Data Analytics Stack (BDAS), as well as provide a brief introduction to the stack as a whole. Currently under development in the UC Berkeley AMPLab, BDAS currently contains eight components that are tightly integrated with each other and with popular components of the Hadoop ecosystem. We will start by covering Spark (a high-speed cluster computing system engine), Spark Streaming (the real-time processing system built on Spark), and Shark (the SQL component on top of Spark). Then we will dive into four newly released components of BDAS. First, BlinkDB, a distributed query that provides ultra-low latency results via approximate, error bounded results. Second, MLbase, a platform for implementing and consuming machine learning algorithms at scale. Third, GraphX extends the distributed fault-tolerant collections API and interactive console of Spark with a new graph API which leverages recent advances in graph systems (e.g., Giraph and GraphLab) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. Finally, Tachyon is a fault tolerant distributed in-memory file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and Hadoop MapReduce.
Sameer is a PhD student in the AMPLab at UC Berkeley. He actively collaborated with Microsoft Researchers on RoPE, an optimizer for parallel executions that has been successfully deployed on production clusters at Microsoft Bing. He completed his undergraduate education in the Department of Computer Science and Engineering at the Indian Institute of Technology, Guwahati in 2009 and was awarded the prestigious President of India Gold Medal.
Tathagata Das is a third-year Ph.D. student in the AMP Lab in UC Berkeley, working Scott Shenker and Ion Stoica. He leads the development of the Spark Streaming project. His research interests include datacenter networks and frameworks for large scale data processing. Before graduate school, he has worked as an Assistant Researcher in Microsoft Research Lab India.
Ali Ghodsi is an Assistant Professor at KTH/Royal Institute of Technology in Sweden and a Visiting Researcher at UC Berkeley since 2009. His general interests are in the broader areas of distributed systems and networking. He received his PhD in 2006 from KTH/Royal Institute of Technology in the area of Distributed Computing.
Ion Stoica is a Professor of Computer Science at UC Berkeley, where he does research on cloud computing and networked computer systems. Past work includes the Dynamic Packet State (DPS), Chord DHT, Internet Indirection Infrastructure (i3), declarative networks, replay-debugging, and multi-layer tracing in distributed systems. His current research includes resource management and scheduling for data centers, cluster computing frameworks, and network architectures. He is the recipient of a SIGCOMM Test of Time Award, the CoNEXT Rising Star Award, the PECASE Award, and the ACM doctoral dissertation award. Ion also co-founded Conviva, a startup to commercialize technologies for large scale video distribution.
Ameet Talwalkar is an NSF post-doctoral fellow in the Computer Science Division at UC Berkeley. His research focuses on devising scalable machine learning algorithms, and more recently, on interdisciplinary approaches for connecting advances in machine learning to large-scale problems in science and technology. He graduated summa cum laude from Yale University and obtained his Ph.D. at New York University. He was awarded the Janet Fabri Prize for the best doctoral dissertation in NYU’s Computer Science Department, Yale’s undergraduate prize in Computer Science, and a Westinghouse Science Talent Search Scholarship.
Reynold Xin is a PhD student in the AMP Lab at UC Berkeley. He leads the research and development of two open source systems: Shark, an analytical SQL engine that is up to 100X faster than Apache Hive; and SparkGraph, a distributed graph computation engine. He is a recipient of Best Demo Award from SIGMOD 2012 and Best Demo Award from VLDB 2011. Before graduate school, he worked on ads infrastructure at Google and distributed databases at IBM.
Matei Zaharia is an assistant professor of computer science at MIT, and the initial creator of Apache Spark. He is currently on industry leave to start Databricks, a company commercializing Spark, where he is CTO.
Joseph is currently a postdoc in the AMPLab at UC Berkeley and co-founder of GraphLab Inc. Joseph received his PhD from the Machine Learning Department at Carnegie Mellon University where he worked with Carlos Guestrin on parallel algorithms and abstractions for scalable probabilistic machine learning. He is a recipient of the AT&T Labs Graduate Fellowship and the NSF Graduate Research Fellowship.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For exhibition and sponsorship opportunities, contact Susan Stewart at email@example.com
For information on trade opportunities with O'Reilly conferences, email firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata contacts