Big Data and Big Analytics: SciDB is not Hadoop

Paul Brown (Paradigm4 Inc.)
Data
Location: Sutton South

Scientists have dealt with big data and big analytics for at least a decade before the business world came to realize they had the same problems, and united around ‘Big Data’, the ‘Data Tsunami’ and ‘the Industrial Revolution of data’. Both the science world and the commercial world share requirements for a high performance informatics platform to support collection, curation, collaboration, exploration, and analysis of massive datasets.

Neither conventional relational database management systems nor hadoop-based systems readily meet all the workflow, data management and analytical requirements desired by either community. They have the wrong data model – tables or files—or no data model. They require excessive data movement and data reformatting for advanced analytics. And they are missing key features, such as provenance.

SciDB is an emerging open source analytical database that runs on a commodity hardware grid or in the cloud. SciDB natively supports:

• An array data model – a flexible, compact, extensible data model for rich, highly dimensional data

• Massively scale math – non-embarassingly parallel operations like linear algebra operations on matrices too large to fit in memory as well as transparently scalable R, MatLab, and SAS style analytics without requiring code for data distribution or parallel computation

• Versioning and Provenance – Data is updated, but never overwritten. The raw data, the derived data, and the derivation are kept for reproducibility, what-if modeling, back-testing, and re-analysis

• Uncertainty support – data carry error bars, probability distribution or confidence metrics that can be propagated through calculations

• Smart storage – compact storage for both dense and sparse data that is efficient for location-based, time-series, and instrument data

We will sketch the design of SciDB and talk about how it’s different from other proposals, and why that matters. We will also put out some early benchmarking data and present a computational genomics use case that showcase SciDB’s massively scalable parallel analytics.

Paul Brown

Paradigm4 Inc.

Paul Brown is the Chief Architect for Paradigm4 and SciDB: an open source array database management system designed to scale in support of very large analytic workloads. Prior to Paradigm4, Paul spent a decade working for IBM Research at the Almaden Research Center in San Jose, CA where he focused on advanced database systems research. Before IBM Paul worked for 15 years at a number of database companies all distinguished by the fact their names started with the letter ‘I’; Ingres, Illustra, and Informix. Paul is the author of several books about database technology, and a dozen research papers over the last fifteen years covering data analysis and DBMS implementation.

Sponsors

  • Aster Data
  • EMC Greenplum
  • GE
  • Lexis Nexis
  • MarkLogic
  • Tableau Software
  • Cloudera
  • DataStax
  • Informatica
  • DataSift
  • Splunk
  • Amazon Web Services
  • Datameer
  • Impetus
  • Karmasphere
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Sybase
  • Xeround
  • Media-Science
  • Platfora

Sponsorship Opportunities

For information on sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata Contacts