Schedule: Data sessions

Add to your personal schedule
Tyler Bell
Location: Sutton South
Tyler Bell (Factual), Leo Polovets (Factual)

Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.

Add to your personal schedule
Philip Kromer
Location: Sutton South
Philip Kromer (Infochimps)

You’ve collected a ton of data and your team is busily crunching numbers and coming to conclusions… but are they the right ones? You can only know with the right context and you can’t get context working in a silo. We invite you to bring the rest of the world into your data warehouse. Don’t worry, it’ll add more value than it takes and instead of working on the data, you can work on your vision.

In this talk, we’ll allay your fears of open data, demonstrate the difference between making decisions with and without context and show you other neat things that happen when you share.

Add to your personal schedule
Daniel Tunkelang
Location: Sutton South
Moderated by:
Daniel Tunkelang (LinkedIn)
Panelists:
Andrew Hogue (Foursquare), Breck Baldwin (Alias-i), Evan Sandhaus (New York Times), Wlodek Zadrozny (IBM)

Structured search improves the search experience through the identification of entities and their relationships in documents and queries. This panel will explore the current state of structured and semi-structured search, as well as exploring the open problems in an area that promises to revolutionize information seeking.

The four panelists below work on some of the world’s largest structured search problems, from offering users structured search on Google’s web corpus to building a computing system that defeated Jeopardy! champions in an extreme test of natural language understanding. They work on the data, tools, and research that are driving this field. They are all excellent researchers and presenters, promising to offer a informative and engaging panel discussion, for which I will act as moderator.

Panelists:

  • Andrew Hogue is a Senior Staff Engineer and Engineering Manager in the Search Quality group at Google New York. He has worked on a wide array of projects including question answering, Google Squared, sentiment analysis, local and product search, and Google Goggles. His is interested in the areas of structured data, information extraction, and machine learning, and their applications to search and search interfaces. Prior to Google, he earned a M.Eng. and B.S. in Computer Science from MIT.
  • Breck Baldwin is the President of Alias-i, creators of the popular LingPipe computational linguistics toolkit. He received his Ph.D. in computer science in 1995 from the University of Pennsylvania. In the time between his thesis on coreference resolution and evaluation and founding Alias-i in 1999, Breck worked on DARPA-funded projects through the University of Pennsylvania.
  • Evan Sandhaus works as the Semantic Technologist in The New York Times Research and Development Labs. He is spearheading The New York Times Linked Open Data Strategy and overseeing the release of 1.8 million documents to the computer science research community. Previously, Evan helped to put The New York Times on Google Earth, collaborated with New York University to explore new directions in News Search, and worked to bring The New York Times to Facebook.
  • Wlodek Zadrozny is an IBM Researcher working on natural language applications. Most recently he worked on text sources for Watson (IBM’s Jeopardy chamption) and applying related DeepQA technology to business problems. His previous work ranged from language processing research to product development and technical planning; in particular, he lead the development of interactions systems that used speech, natural language and focused search. Wlodek Zadrozny received a Ph.D. in Mathematics, from the Polish Academy of Science.

Moderator:

Daniel Tunkelang oversees the data science team at LinkedIn, which analyzes terabytes of data to produce products and insights that serve LinkedIn’s members. Prior to LinkedIn, Daniel led a local search quality team at Google. Daniel was a founding employee and Chief Scientist of Endeca, a leader in enterprise search and business intelligence that pioneered the use of guided navigation in search applications. He has authored eight patents, written a textbook on faceted search, created the annual workshop on human-computer interaction and information retrieval (HCIR), and participated in the premier research conferences on information retrieval, knowledge management, databases, and data mining (SIGIR, CIKM, SIGMOD, SIAM Data Mining). Daniel holds a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.

Add to your personal schedule
Elizabeth Charnock
Location: Sutton South
Elizabeth Charnock (Cataphora)

Whether you believe the hype around Big Data or not, the amount of information accruing throughout large organizations is getting more profound every day. And it’s not simply a question of volume; of equal concern is the variety of data. There are emails, IMs, tweets, Facebook updates and the fastest-growing category of data: video. This variety makes it difficult to generate an apples-to-apples comparison of data from a single individual or entity. Combine this with the fact that experts think that there is no such thing as ‘clean’ data, and you have a growing problem.

This is why it is better to focus on understanding digital character. As with individuals, electronic data has ‘character.’ That character helps to disambiguate the relationship between one piece of data and another. This is particularly important given that because communication is more fragmented than ever, it makes relevance more difficult to ascertain.

Digital character is similar to individual character in the real world; particularly in the sense that character emerges over time. Does one embarrassing photo or comment on Facebook define an individual’s lifetime character? Can’t everyone recollect an email they wish they had never sent? Just as in the real world, digital character requires a large enough body of work to make an accurate character judgment.

Elizabeth Charnock, CEO of Cataphora and author of E-Habits, will discuss the pitfalls of Bad Data, and how it manifests itself in the interaction between a male stripper and a Harvard professor.

Add to your personal schedule
Richard McDougall
Location: Sutton South

This talk will address the question of how to enable a much more agile data provisioning model for business units and data scientists. We’re in a mode shift where data unlocks new growth, and almost every Fortune 1000 company is scrambling to architect a new platform to enable data to be stored, shared and analyzed for competitive advantage. Many companies are finding that this shift requires major rethinking of how systems should be architected (and scaled) to enable agile, self-service access to critical data.

In this session we’ll discuss strategies for building agile big-data clouds that make it much faster and easier for data scientists to discover, provision and analyze data. We’ll discuss where and how new technologies (both vendor and OSS) fit into this model.

We will also discuss changes in application architectures as big-data begins to play a role in online applications, incorporating many big-data techniques to deliver consumer-targeted content. This new “real-time” analytics category is growing fast and several new data systems are enabling this shift. We’ll review which players and technologies in the NoSQL community are helping drive this architecture.

Add to your personal schedule
Scott Nicholson
Location: Murray Hill Suite A
Scott Nicholson (LinkedIn)

Economists utilize a data analysis toolkit and intuition that can be very helpful to Data Scientists. In particular, econometric methods are quite useful in disentangling correlation and causation, a use case not well-handled by standard machine learning and statistical techniques. This session will cover examples of econometric methods in action, as well as other economics-related insights. Think of it as a crash-course in basic econometric intuition that one receives during a PhD in Economics (I received my PhD from Stanford in 2008).

Why econometrics? The difference between econometrics and statistics is that statistical modeling is more concerned with fit, and econometric modeling is more concerned with properly estimating the coefficients in a regression. Getting the “right” (consistent & unbiased) estimates means that the analyst can more effectively measure how a change in one variable can strongly predict (or cause) a change in the dependent variable. These techniques can help solve problems in social/web data that previously were only solvable using future data collection from randomized multivariate experiments.

To do this, the analyst first develops an intuition for whether or not there is a source of “endogeneity” in the regression. This largely is determined by the relationship between the predictors and the error term in the regression. Once the source of the endogeneity is understood, econometric techniques like fixed/random effects and instrumental variables can be quite useful. The type of data that is collected and available is key to the extent to which the power of these techniques can be used. [I might also go into some other techniques, but these are the most useful]

The methods will be presented in a way so that a non-technical person can understand the basic intuition, and also so that a practitioner can apply the methods in the future. Examples will be provided. For panel data econometrics, we will discuss the example of how to identify actions taken early on by a LinkedIn member that are predictive of their future engagement with the product, a problem that is difficult due to the confounding of correlation and causation. For instrumental variables techniques, we will discuss how to use random variation in the weather to say cool things about politics, economics, and web usage.

In addition to the discussion of applied econometric techniques, there may also be time for economics-related data insights. Currently we are developing unemployment rate prediction models using time-series econometrics as well as indexes to measure changes in the supply/demand for talent across regions and industries.

Add to your personal schedule
 
Location: Sutton South
Paul Brown (Paradigm4 Inc.)

Scientists have dealt with big data and big analytics for at least a decade before the business world came to realize they had the same problems, and united around ‘Big Data’, the ‘Data Tsunami’ and ‘the Industrial Revolution of data’. Both the science world and the commercial world share requirements for a high performance informatics platform to support collection, curation, collaboration, exploration, and analysis of massive datasets.

Neither conventional relational database management systems nor hadoop-based systems readily meet all the workflow, data management and analytical requirements desired by either community. They have the wrong data model – tables or files—or no data model. They require excessive data movement and data reformatting for advanced analytics. And they are missing key features, such as provenance.

SciDB is an emerging open source analytical database that runs on a commodity hardware grid or in the cloud. SciDB natively supports:

• An array data model – a flexible, compact, extensible data model for rich, highly dimensional data

• Massively scale math – non-embarassingly parallel operations like linear algebra operations on matrices too large to fit in memory as well as transparently scalable R, MatLab, and SAS style analytics without requiring code for data distribution or parallel computation

• Versioning and Provenance – Data is updated, but never overwritten. The raw data, the derived data, and the derivation are kept for reproducibility, what-if modeling, back-testing, and re-analysis

• Uncertainty support – data carry error bars, probability distribution or confidence metrics that can be propagated through calculations

• Smart storage – compact storage for both dense and sparse data that is efficient for location-based, time-series, and instrument data

We will sketch the design of SciDB and talk about how it’s different from other proposals, and why that matters. We will also put out some early benchmarking data and present a computational genomics use case that showcase SciDB’s massively scalable parallel analytics.

Add to your personal schedule
Monica Rogati
Location: Murray Hill Suite A
Monica Rogati (LinkedIn)

How do data infrastructure, insights and products change when your user base grows by orders of magnitude? When should you move your user-facing data product off your laptop? (hint: now!) Does your data offer insights about the world at large, or is it just mirroring your early adopters? In this talk, I will share some of the data scaling lessons we’ve learned at LinkedIn, recount war stories (and close calls!) and document the evolution of the data scientist.

Add to your personal schedule
Chris van der Walt
Location: Sutton North
Chris van der Walt (United Nations Global Pulse), Dane Petersen (Adaptive Path), Sara Farmer (UN Global Pulse)

Global Pulse is a United Nations innovation initiative that is developing a new approach to crisis impact monitoring. One of the key outputs of the project is HunchWorks, a place where experts can post hypotheses—or hunches—that may warrant further exploration and then crowdsource data and verification. HunchWorks will be a key global platform for rapidly detecting emerging crises and their impacts on vulnerable communities. Using it, experts will be able to quickly surface ground truth and detect anomalies in data about collective behavior for further analysis, investigation and action.

The presentation will open with an introduction by Chris van der Walt (Project Lead, Global Pulse) to the problem that HunchWorks is being designed to address: How to detect the emerging impacts of global crises in real-time? A short discussion of the design thinking behind HunchWorks will follow plus an overview of the HunchWorks feature set.

Dane Petersen (Experience Designer, Adaptive Path) will then discuss some of the complex user experience design challenges that emerged as the team started to wrestle with developing HunchWorks and the approaches used to address them.

Sara Farmer (Chief Platform Architect, Global Pulse) will follow up with a discussion of the technology powering HunchWorks, which is based on autonomy, uncertain reasoning and human-machine team theories, and is designed to to allow users and automated tools to work collaboratively to reduce the uncertainty and missing data issues inherent in hunch formation and management.

The presentation will conclude with 10 minutes of Q&A from the audience.

Add to your personal schedule
Justin Moore
Location: Murray Hill Suite A
Peter Sirota (Amazon Web Services), Justin Moore (Facebook)

By pairing the elasticity and pay-as-you-go nature of the cloud with the flexibility and scalability of Hadoop, Amazon Elastic MapReduce has brought Big Data analytics to an even wider array of companies looking to maximize the value of their data. Each day, thousands of Hadoop clusters are run on the Amazon Elastic MapReduce infrastructure by users of every size—from University students to Fortune 50 companies—exposing the Elastic MapReduce team to an unparalleled number of use cases. In this session, we will contrast how three of these users, Amazon.com, Yelp, and Etsy, leverage the marriage of Hadoop and the cloud to drive their businesses in the face of explosive growth, including generating customer insights, powering recommendations, and managing core operations.

Add to your personal schedule
Ben Gimpert
Location: Sutton South
Ben Gimpert (Altos Research)

Most people in our community are accustomed to thinking of a “model” as the end result of a properly functioning big data architecture. Once you have an EC2 cluster reserved, after the database is distributed across some Hadoop nodes, and once a clever MapReduce machine learning algorithm has done its job, the system spits out a predictive model. The model hopefully allows an organization to conduct its business better.

This waterfall approach to modeling is embedded in the hiring process and technical culture of most contemporary big data organizations. When the business users sit in one room and the data scientists sit in another, we preclude one of the most important benefits of having on-demand access to big data. Models themselves are powerful exploratory tools! However, data sparsity, non-linear interactions and the resultant model’s quirks must be interpreted through the lens of domain expertise. All big data models are wrong but some are useful, to paraphrase the statistician George Box.

A data scientist working in isolation could train a predictive model with perfect in-sample accuracy, but only an understanding of how the business will use the model lets her balance the crucial bias / variance trade-off. Put more simply, applied business knowledge is how we can assume a model trained on historical data will do decently with situations we have never seen.

Models can also reveal predictors in our data we never expected. The business can learn from the automatic ranking of predictor importance with statistical entropy and multicollinearity tools. In the extreme, a surprisingly important variable that turns up during the modeling of a big data set could be the trigger of an organizational pivot. What if a movie recommendation model reveals a strange variable for predicting gross at the box office?

My presentation introduces exploratory model feedback in the context of big (training) data. I will use a real-life case study from Altos Research that forecasts a complex system: real estate prices. Rapid prototyping with Ruby and an EC2 cluster allowed us to optimize human time, but not necessarily computing cycles. I will cover how exploratory model feedback blurs the line between domain expert and data scientist, and also blurs the distinction between supervised and unsupervised learning. This is all a data ecology, in which a model of big data can surprise us and suggest its own future enhancement.

Add to your personal schedule
Ryan Boyd
Location: Murray Hill Suite A
Ryan Boyd (Google), Chris Schalk (Google)

Google is a Data business: over the past few years, many of the tools Google created to store, query, analyze, visualize its data, have been exposed to developers as services.

This talk will give you an overview of Google services for Data Crunchers:
  • Google Storage for developers: get your data in Google Cloud
  • BigQuery, fast interactive queries on Terabytes of data
  • Prediction API: Machine Learning made easy
  • Google App Engine:platform as a service to build web apps or expose APIs
  • Visualization API: many cool visualization components
  • Fusion Tables: collaborate and visualize your data on a Map
  • Google Public Data Explorer, to expose and visualize public data
  • Services that have not been announced as of the writing of this proposal but may be available when the conference happens:-)
Add to your personal schedule
Justin Moore
Location: Sutton South
Justin Moore (Facebook)

Foursquare stores and processes everything from check-ins to screen views using a combination of home grown and open source tools. This talk covers an overview of our stack, highlighting specific examples of how, and why, it grew to what it is today and continues with the many ways that this infrastructure is employed.

One such example is our data-driven product development with the recently launched recommendations engine, named “Explore.” Explore recycles past check-in data into signals like venue similarity and time-sensitive popularity measures, resulting in intelligent recommendations building upon past user behavior as well as social and bookmarking features.

This talk takes a closer look at how Explore, and other features, emerged from our data analysis as well as the iterative process of monitoring and improvement that is critical for making such features a success.

Add to your personal schedule
Dwight Merriman
Location: Sutton North
Dwight Merriman (10gen)

As CTO of DoubleClick, we scaled to serve 400,000 ads/second. We developed and used many custom data stores long before “nosql” was a buzzword. Over the years, I’ve seen companies I’ve worked with struggle with both scalability and agility. Writing the first lines of MongoDB code in 2007, we drew upon these experiences building large scale, high availability, robust systems. We wanted MongoDB to be a new kind of database that tackled the challenges we were trying to solve at DoubleClick.

This session will focus on internet infrastructure scaling and also cover the history and philosophy of MongoDB.

Add to your personal schedule
Ken Farmer
Location: Sutton South
Ken Farmer (IBM)

While most of the focus within data science is on the rapid analysis of vast volumes of data, the hardest part of most solutions is the data acquisition, movement, transformation, and loading – the “data logistics”.

Since the early days of data mining and data warehousing (in the late 80s and early 90s) it has been understood that 90% of the effort of these projects will be spent on data acquisition, cleansing, transformation and consolidation. The challenges include:

  • undocumented source systems
  • source systems that change business rules without notice
  • source systems that cannot handle frequent extracts of data without encountering concurrency problems
  • source system constraints on languages, network connections, and products
  • the management of thousands of daily processes
  • the management of data logistics code that manages dozens of feeds
  • the rapid loading of data into the consolidated server – without impacting concurrency or creating temporary data inconsistencies

The data warehousing domain refers to data logistics as “ETL” for Extract, Transform, and Load. Some best practices and methods have developed to address these challenges, but little effort has been put into reusable patterns – more effort has gone into mostly commercial products. But in spite of a lack of formal patterns, a sense of what works and what doesn’t work has emerged – and can be read “between the lines” if someone knows what to look for.

This presentation will describe what the challenges look like when trying to deliver data insights that out of necessity span many sets of data. It will explain these in both business and technical terms. And then will procede to address some of the common solutions – and their strengths and weaknesses.

Sponsors

  • Aster Data
  • EMC Greenplum
  • GE
  • Lexis Nexis
  • MarkLogic
  • Tableau Software
  • Cloudera
  • DataStax
  • Informatica
  • DataSift
  • Splunk
  • Amazon Web Services
  • Datameer
  • Impetus
  • Karmasphere
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Sybase
  • Xeround
  • Media-Science
  • Platfora

Sponsorship Opportunities

For information on sponsorship opportunities at the conference, contact Susan Young at syoung@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata Contacts

Speakers Video