We will describe the BigData Top100 List initiative—an new, open, community-based effort for benchmarking big data systems. The BigData Top100 list will rank big data systems according to a well-defined, audited performance metric. The benchmark also provides an accompanying efficiency metric. With “big data” becoming a major force of innovation across enterprises of all sizes, new platforms for managing big data sets are being announced almost on a weekly basis with increasingly more features. Yet, there is currently a lack of any means of comparability among such platforms. While the performance of traditional database systems is well understood and measured by long-established institutions such as the Transaction Processing Performance Council, there is neither a clear definition of the performance of big data systems nor a generally agreed upon metric for comparing these systems. This session unveils a community-based effort for defining an end-to-end application-layer benchmark for big data applications, with the ability to easily adapt the benchmark specification to evolving challenges in the big data space. We actively seek community input into this process.
Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system, and has been contributing and working with Hadoop since version 0.1.0. He started the Yahoo! Grid solutions team focused on training, consulting, and supporting hundreds of new migrants to Hadoop. Parallel programming languages and paradigms has been his area of focus for over 20 years. He worked at the Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo! and Linkedin. Currently, he is the Chief Scientist at Greenplum, a division of EMC.
Chaitan Baru is Distinguished Scientist and Associate Director Data Initiatives at the San Diego Supercomputer Center, University of California San Diego, where he also directs the Center for Large-scale Data Systems Research (CLDS). Baru’s interests are in research and development in the areas of parallel database systems, scientific data management, data analytics, and the challenges of data-driven science and data-driven enterprises. Baru has played a leadership role in a number of national-scale cyberinfrastructure R&D efforts across a wide range of science disciplines from earth sciences to ecology, biomedical informatics, and healthcare. Prior to joining SDSC in 1996, Baru led one of the development teams at IBM for an early UNIX-based shared-nothing database systems (DB2 Parallel Edition) and also led a team that produced the first result for an industry-standard decision support benchmark (TPC-D). Over the past one year, Baru has led the effort to create a Big Data Benchmarking community, leading to the proposal to create a BigData100 List, borrowing benchmarking ideas from the high-performance computing and transaction processing and database communities.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at email@example.com
For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata contacts