Big Analytics Beyond the Elephants

Paul Brown (Paradigm4 Inc.)
Data Science, Ballroom CD

Scientists dealt with big data and big analytics for at least a decade before the business world precipitated buzz-words like ‘Big Data’, ‘Data Tsunami’ and ‘the Industrial Revolution of data’ from the strange broth of their marketing solution and came to realize they had the same problems. Both the scientific world and the commercial world share requirements for a high performance informatics platform supporting the collection, curation, collaboration, exploration, and analysis of massive datasets.

In this talk we will sketch the design of SciDB and explain how it differs from hadoop-based systems, SQL DBMS products, and NoSQL platforms, and explain why that matters. We will present benchmarking data and present a computational genomics use case that showcase SciDB’s massively scalable parallel analytics.

SciDB is an emerging open source analytical database that runs on a commodity hardware grid or in the cloud. SciDB natively supports:

• An array data model – a flexible, compact, extensible data model for rich, highly dimensional data

• Massively scale math – non-embarassingly parallel operations like linear algebra operations on matrices too large to fit in memory as well as transparently scalable R, MatLab, and SAS style analytics without requiring code for data distribution or parallel computation

• Versioning and Provenance – Data is updated, but never overwritten. The raw data, the derived data, and the derivation are kept for reproducibility, what-if modeling, back-testing, and re-analysis

• Uncertainty support – data carry error bars, probability distribution or confidence metrics that can be propagated through calculations

• Smart storage – compact storage for both dense and sparse data that is efficient for location-based, time-series, and instrument data

Paul Brown

Paradigm4 Inc.

Paul Brown is the Chief Plumber for Paradigm4 and SciDB: an open source array database management system designed to scale in support of very large analytic workloads. Prior to Paradigm4, Paul spent a decade working for IBM Research at the Almaden Research Center in San Jose, CA where he focused on advanced database systems research. Before IBM Paul worked for 15 years at a number of database companies all distinguished by the fact their names started with the letter ‘I’; Ingres, Illustra, and Informix. Paul is the author of several books about database technology, and a dozen research papers over the last fifteen years covering data analysis and DBMS implementation.

Sponsors

  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

View a complete list of Strata contacts