The Workflow Abstraction

Paco Nathan (Databricks)
Hadoop in Practice Great America Ballroom K
Average rating: *****
(5.00, 2 ratings)

In our uses of Data Science to build products (namely, apps for business verticals) we encounter repeated patterns among the common use cases for Big Data. In practice, these patterns involve the integration of multiple systems and heterogeneous frameworks. We specify the business process and trade-offs for any given app, defined as a combination of many parts. That work typically crosses several different vendors and open source platforms, in contrast to the notion of “One Size Fits All”. In other words, we leverage a notion of blended use cases to build Big Data apps.

For example, MapReduce (e.g., Hadoop) as a compute framework is rarely if ever used in isolation. Hadoop-based apps tend to consume data from multiple sources, e.g., distributed file systems, key/value stores, document collections, JDBC into relational database, S3 and other durable grids, etc. In turn, they produce results which tend to get stored elsewhere, e.g., in the common case of an API consuming from a cache layer.

The workflow, as an abstraction layer, is generalized as a directed, acyclic graph (DAG). A DAG specifies a set of endpoints, dependencies, and transformations, which tie together the many required parts and subsystems. This analysis leads toward a notion of data access patterns, akin to the design patterns leveraged in software engineering. The resulting pattern language provides a formalism for architectural recipes, best practices, code reuse, and Enterprise-scale optimizations.

This talk examines common use cases in Data Science, leading toward a set of data access patterns. For example, marketing funnel optimization is one such use case, which is ubiquitous in e-commerce. A formalism for workflow abstraction is proposed, then reviewed in the context of a sample app based on the Cascading open source project.

Photo of Paco Nathan

Paco Nathan

Databricks

Data Scientist for Concurrent in SF, and a committer on the Cascading open source project. 10+ years leading innovative Data teams, 25+ yrs in tech industry overall. Background in math/stats and distributed computing. Expertise in Hadoop, R, AWS, predictive analytics, machine learning, NLP

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts