A way to introduce the idea that access to Big Data in many countries – especially Argentina – is still a work in progress and somewhat politicized. Despite that, media like La Nacion Newspaper, are working with developers and experts in Data Viz to address the lack of transparency and accountability.
Big data gives us a powerful new way to see patterns in information - but what can't we see? When does big data not tell us the whole story? This talk opens up the question of the biases we bring to big data, and how we might work beyond them.
This session is an overview of Apache Drill, another big data system inspired by a Google white paper.
An Introduction to the Berkeley Data Analytics Stack (BDAS) Featuring Spark, Spark Streaming, and Shark - Part 1 Presentation
An Introduction to the Berkeley Data Analytics Stack (BDAS) Featuring Spark, Spark Streaming, and Shark - Part 1 Presentation 1
An Introduction to the Berkeley Data Analytics Stack (BDAS) Featuring Spark, Spark Streaming, and Shark - Part 1 Presentation 2
An Introduction to the Berkeley Data Analytics Stack (BDAS) Featuring Spark, Spark Streaming, and Shark - Part 1 Presentation 3
An introduction Spark and Shark, two components of the open-source Berkeley Data Analytics Stack (BDAS) in development at UC Berkeley. Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100x. Shark is a port of Apache Hive onto Spark that is fully compatible with, and up to 100x faster than, Hive.
Crunch 40 years worth of daily global satellite data at the push of a button, perform spatial analyses on GBs of your own GIS data and securely share the results privately or publish to 1B Google Earth users. This talk will focus on how what was once the realm of a few is now easily and intuitively accessible from the comfort of your Chrome browser.
IBM and the University of Oxford partnered in late 2012 to explore and understand how organizations and have really begun leveraging big data to create competitive advantage in the marketplace. The joint study, based on a survey of more than 1100 business and IT executives, combines executive interviews and case studies to establish benchmarks that define the big data era ahead
The promise of big data is to enable business transformation through new and powerful insights. Such transformations are mandated by the executive team, but how does that work with IT? What does big data success look like, and how do enterprises get there? Discover why big data is a business-owned problem, and how the relationship between IT and business must change.
At Strata 2012 in New York, we discussed the hazards of curbing big data inferences by defining a new category of thoughtcrime. After all, acting on thoughts might constitute a crime, but thoughts, in isolation, cannot be criminal. It's time to go deeper. Let's create and evaluate a predictive criminal model that highlights where the sensitivities lie, both technically and ethically.
Data science for consumer internet products relies on our ability to effectively analyze and understand ubiquitous computing in terms of a holistic product experience, as individuals consume and create data on mobile and desktop devices in their day-to-day lives. I'll talk about mobile data science challenges — from product development to data-driven decision making.
As big data makes inroads into all aspects of society, how governments
regard the technology will be critical for its success. If the past is
a guide, the state will embrace big data for its own uses (both good
and ill). It will recognize that its authority is threatened and lash
In this key note, we will explore some of the challenges of big data operating in a truly global context.
Factual believes that some data problems are bigger than any one company. This talk describes how Factual combines both machines and other (human) data communities to their best effect, within the context of similar data-centric, community-driven applications.
In this talk, we present the broad data challenge and discuss potential starting points for solutions. We illustrate these approaches using data from a "meta-catalog" of over 1,000,000 open datasets that have been collected from about two hundred governments from around the world.
This talks dives into the extreme details of Building Recommendation Platforms. It covers the end to end Architecture and Design of such a system. It dives into the various ML Algorithms to be used along with their details. It also covers the Solutions to commonly seen Recommendation Patterns and detailed Use Cases along with their Solution.
The Infrastructure team at Stumbleupon leverages the state of the art tools and technologies to build platforms that enable us collect, categorize, organize, store and analyze huge volumes of data. The platform is fast and robust that it adds minimal latency to the site.Timely collection and analysis of data helps data scientists, analysts and executives make the best decisions and validate them.
In this session we’ll first discuss our experience extending Hadoop development to new platforms & languages and then discuss our experiments and experiences building supporting developer tools and plugins for those platforms.
In this talk, Susan Etlinger will discuss how organizations are
addressing the challenges of social data--technological,
organizational and cultural--and what it can teach us on the road to
Many companies have figured out how to generate incremental value through the use of recommendation engines. As such, the underlying algorithms are considered a valuable asset. But what happens when a company’s entire business model rests on its ability to get relevant products in front of the customer? When this happens you see a massive commitment to algorithms, data, and data scientists.
Communicating Data Clearly describes how to draw clear, concise, accurate graphs that are easier to understand than many of the graphs one sees today. The tutorial emphasizes how to avoid common mistakes that produce confusing or even misleading graphs. Graphs for one, two, three, and many variables are covered as well as general principles for creating effective graphs.
At Strata RX, we announced the release of DocGraph, the largest open named social graph data set that we know of. This data set included links between doctor who commonly team together in the Medicare dataset.
Since then, we have added tremendous depth to the data by crowdfunding the acquisition of doctor credentialing data. Come learn how healthcare works under the cover.
Everyone is looking for ways to define data as an asset that can be monetized. But data itself will never move the needle for the Fortune 1000. Data is a means to an end. The end is not just insight, or knowledge, or brief moments of wisdom (when marveling at gorgeous data visualizations). The end we seek is wise action.
Data Science has created quite the movement in the data world, yet confusion between data science and analytics still remain across the enterprise. Rather than approach the subject talking about semantic differences between the two, we will discuss the topics as they relate to solving problems, how businesses are approaching them and what you can start doing with data science.
This session explores applications of Shneiderman’s mantra for visual data analysis (overview first, zoom and filter, then details-on-demand) as a framework in the context of three complex analytical applications at Wells Fargo: (1) Analytics process, (2) Interactive meeting facilitation and (3) Dashboard design.
In this talk I will discuss the realities of human productivity bottlenecks in data analysis, and give an overview of research and product directions for addressing this critical bottleneck in a substantive way.
How Airbnb was able to quickly spin big data into a meaningful response to Super Storm Sandy.
How software can transform human lives by bringing intelligence to wherever big data lives.
The emergence of Apache Hadoop over the past few years has required organizations
to completely rethink architectures that have been in place for decades. And with
changes in the underlying data fabric, come ripple effects, and often bottlenecks,
that impact all levels of an organization both business and technical.
Electronic discovery has transformed the way cases are litigated. Gone are the days of manual review, where litigators spent days poring over emails, messages, and documents. Today's e-discovery technologies mine through vast troves of information, looking for the needle in the proverbial haystack that will blow a case wide open.
Many of the services that are critical to Google’s ad business have historically been backed by MySQL. We have recently migrated several of these services to F1, a new RDBMS developed at Google. F1 implements rich relational database features, including a strictly enforced schema, a powerful parallel SQL query engine, general transactions, change tracking and notiﬁcation, and indexing.
Visualization is a powerful way to understand data, but today building the right data set and accompanying data visualization requires sophisticated programming skills. We discuss an approach to a unified language describing both visualization and database queries. This approach could be used by both programmers and business users, accelerating data exploration and speeding time to insight.
Most stable systems rely on feedback - from central heating to industrial plants and biological organisms. This introductory talk will explain what feedback is, why it is relevant to enterprise software development, and how to apply it to some typical problems arising in business and technical situations.
This talk discusses the broad design considerations necessary for effective visualizations. Attendees will learn what's required for a visualization to be successful, gain insight for critically evaluating visualizations they encounter, and come away with new ways to think about the visualization design process.
How must big companies evolve in order to realize big value from big data? Investing in data, technology and data scientists is just a first step.
How must big companies evolve in order to realize big value from big data? Investing in data, technology and data scientists is just a first step.
Hadoop and SAP HANA are taking the world by storm. SAP HANA is the fastest growing commercial database in the market, being adopted by the world’s top enterprises for real-time analytics and applications.
This hands-on tutorial teaches you how to use Hive, a high-level, data warehouse tool for Hadoop. Hive provides a SQL-like query language, HiveQL, that is easy to learn for people with prior SQL experience, making Hive attractive for data warehousing teams. Hive leverages the power of Hadoop for working with massive data sets without requiring expertise in MapReduce programming.
The excitement about Big Data stems from the results: the impact on revenue, the decrease in costs, the Big gains in competitive advantage that result from Hadoop and HBase applications. This keynote provides insights into how the combination of scale, efficiency and analytic flexibility creates the power to expand the applications for Hadoop to transform companies as well as entire industries.
Hadoop is the engine powering the Big Data era, an unstoppable force boasting
massive investments and a rich ecosystem. But this is only the beginning: Hadoop
has the potential to reach beyond Big Data and become the Foundation for Change,
catalyzing new levels of business productivity and transformation.
Hadoop will become the Foundation for Change.
C. Aaron Cois
(Carnegie Mellon University, Software Engineering Institute),
(Carnegie Mellon University, Software Engineering Institute)
In this talk, we describe using Redis, an open source, in-memory key-value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics. With this approach, we were able to capture a high volume of continuous data from numerous remote environmental sensors while consistently querying our database for real time monitoring and analytics.
This session will demonstrate to attendees how easy it is to crowdsource identity theft to commit fraud and make money. We will look at which segments of the population are easy targets for large scale identity fraud. Attendees will be given methodologies to combat this type of fraud leveraging Big Data and various technologies.
Designing for human fault-tolerance leads to important conclusions on the fundamental ways data systems should be architected.
The Cloudera Impala project is for the first time making scalable parallel database technology, which is the underpinning of Google's Dremel as well as that of commercial analytic DBMSs, available to the Hadoop community.
In this talk, we'll examine compelling, real-world examples that offer a blueprint for integrating big data technologies, delivering rapid visibility and insights to IT professionals, data analysts and business users, and that accelerate the adoption of big data in the enterprise.
Julia is a new mathematical programming language that is scalable, high-performance, and open source. Julia is fast, approaching and often matching the performance of C/C++, easy to learn, and designed for distributed computation. This session will demonstrate some of the special capabilities of Julia and give you the tools you need to get started using this exciting technical computing language.
This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems. No programming experience is required.
Everyone wants to predict the future; fame and fortune follow those who succeed. I cover the basics of forecasting including tips, tricks, and best practices, and how forecasting differs from prediction analysis. I walk through simple examples using R and link to several resources to put you on the path to becoming the next Nostradamus.
As more industries adopt data-driven policies, people untrained in the formal analysis of data are find themselves staring at a spreadsheet and asking what they did to deserve it. In this tutorial, two of Kaggle’s top data scientists will walk attendees through the basics of solving an analytics challenge, from defining the problem, to performing basic analysis, to visualizing the output.
Data science efforts can be derailed for many reasons. We highlight common pitfalls on planning & executing data science: the optimal organizational mindsets, the technical considerations, and what constitutes the diverse skills of practitioners. This talk is based on the upcoming Bad Data Handbook as well as a survey and analysis of a few hundred Data Science practitioners from around the world.
The majority of the world's data is now unstructured, non-English text. How can we extract useful information from it? Many of our assumptions about English do no carry over to other languages. This talk will give a high-level overview of how languages vary, what current language technologies can (and cannot) achieve, and how we can process and visualize this information at scale.
This panel will share insights on how K-16 education can benefit from developments in Big Data ecosystems.
In today’s data-driven age, healthcare is transitioning from opinion-based decisions to informed decisions based on data and analytics. Analyzing the data reveals trends and knowledge that may run contrary to our assumptions causing a shift in ultimate decisions that in turn will better serve both patients and healthcare enterprises.
Learn how LinkedIn endorsements used data mining techniques to develop a viral social tagging and reputation system.
The majority of data we consume today are presented in lists, one-dimensional orderings that limit the users ability to understand context or perform strategic analyses. For unstructured data, we need to re-imagine what types of visualisations enable exploration in the way that geographic maps can.
This talk will discuss Rest Devices proprietary low-cost sensor technology, its use of and vision for big biometric data, and the need for design integration in all facets of product development, be it software or hardware.
Code for America fellows have been tackling not only the promise of
data in America’s cities, but the reality of the challenges, for the
past two years. In February 2013, six new fellows will be working on
our hardest problem yet: using data to unclog the criminal justice
system in Louisville and New York City. If the public sector can
innovate using data, and results benefit us all.
Hear from MailChimp’s Chief Scientist John Foreman as he dishes on dirty data and demonstrates the latest in MailChimp’s anti-abuse artificial intelligence. MailChimp sends 3 billion emails a month for their millions of users, and they can't afford to let a drop of spam go out. Learn how the company is using cutting edge NoSQL solutions and predictive models to leave the bad guys out in the cold.
Rachel Schutt, Senior Research Scientist at Johnson Research Labs, will discuss her Columbia Data Science course: her motivations for teaching it, how she designed the curriculum, how the NYC tech community was involved, and what impact, if any, she had on her students. She thought about the course as testing the hypothesis: It is possible to incubate awesome data science teams in the classroom.
Prepare for the coming zombie apocalypse or subjugation by our vampire overlords by tracking the spread of these threats and understand the characteristics of the populations already infected using a combination of social media analytics and classic market research cluster analysis. Learn about new methods for unpacking consumer conversations and tracking true attitudinal consumer segments.
Cloudera, the standard for Apache Hadoop in the enterprise, empowers data-driven enterprises to Ask Bigger Questions™ and get bigger answers from all their data at the speed of thought. Cloudera Enterprise, the platform for Big Data, enables organizations to easily derive business value from structured and unstructured data to achieve a significant competitive advantage.
With the growth in volume and velocity of data, businesses need a scalable solution alongside batch processing to process events on the fly and provide real time insights. In this session, we will describe how we used Storm to analyze network data to detect causes of network performance degradation.
How do you deploy real-time predictive models to production environments? This talk describes a five-stage process which begins with data distillation and ends with real-time in-database model scoring. We'll discuss the technologies used at each stage, and share some best practices for development and implementation of real-time models.
Learn how LivePerson and Zoomdata perform stream processing and visualization on mobile devices of structured site traffic and unstructured chat data in real-time for business decision making. Technologies include Kafka, Storm, and d3.js for visualization on mobile devices. Byron Ellis, Data Scientist for LivePerson will join Justin Langseth of Zoomdata to discuss and demonstrate the solution.
To kick off the Big Data for Enterprise IT Day, we present two views of big data. Is it truly something new, or just an evolution of what we have already? Join us for an interesting and entertaining talk that will help frame your thinking on big data.
Khaled El Emam
(Children's Hospital of Eastern Ontario - Research Institute & University of Ottawa)
There are often privacy and confidentiality concerns with putting sensitive personal information about employees or customers on the cloud. Secure computation methods allow the release of encrypted data and still performing complex data analytics on that encrypted data. This presentation will describe how secure analytics work and give examples of their application in the healthcare context.
In many modern web and big data applications the data arrives in a streaming fashion and needs to be processed on the fly. Due to the size of data, the computations need to be done incrementally, and hence sketches of data are used that take a small amount of memory but allow for fast updates and queries. We will present the techniques to design these sketches and provide clarifying examples.
I will discuss how a wearable sensing platform, the Sociometric Badge, allows us to measure and analyze human behavior in the real-world, particularly in the workplace. We’ll discuss how we use the badges to recognize concepts such as persuasiveness and social support and how we have used the badges in real companies to drive organizational change and put hard numbers behind management methods.
This talk is about the emergence of a new class of analytic databases based on principles first popularized by Google Dremel. These systems have been designed with the goal of enabling real-time SQL on Hadoop, while also supporting schema-on-read, semi-structured data, and pluggable storage engines. In this talk we will explain the novel architectural features that make these goals a reality.
Privacy laws as to a company’s obligations on data collection, use, disclosure are changing rapidly. Failing to understand how the laws affect a company’s personal data assets can result in media exposes, regulatory investigations, Congressional hearings and lawsuits. This session will provide guidance on “privacy by design” compliance and practical tips to avoid becoming a target of scrutiny.
Opposites attract and that’s the case with Hadoop and analytic databases. Both have a role to play in your Big Data projects. This session explores the various approaches to cementing the bond between Hadoop to your analytic database, how SAP customers are integrating Hadoop into BI and advanced analytic environments, and why you’ll want to do that too.
We will describe the BigData Top100 List initiative—an new, open, community-based effort for benchmarking big data systems.
For centuries, business has been about scale. Business students are taught that cconomies of scale are the only long-term sustainable advantage, because with scale you can control markets, set prices, own channels, influence regulators, and so on. But thanks to software and big data, however, scale’s importance is waning.
In this talk, I will introduce the IPython Notebook, an open-source, web-based interactive computing environment for Python and other languages. By enabling the data scientist to build documents that combine code, text, formulas, visualizations, images and video the Notebook creates a foundation for data science that is interactive, repeatable, documented and sharable.
Just the basics: you've probably heard about data mining and think you need a PhD to do it. Clever stuff with numbers. Predictions. Clusters. Algorithms. The 9 Laws explains the why of the basic steps you can take to be successful as a data miner, and show that this is primarily a business discipline, not a branch of computer science.
Data science can power incredible innovation, but the most important insights typically aren't known ahead of time. This makes it challenging to manage schedules, expectations, and goals.
At Decide, data science is core to our product. This talk will share lessons learned from both sides, and provide the audience with strategies to improve process and communication in their own teams.
The Victory Lab presents a secret history of modern American politics, pulling back the curtain on the tactics and strategies used by some of the era's most important figures-including Barack Obama and Mitt Romney-with iconoclastic insights into human decision-making, marketing and how analytics can put any business on the road to victory.
All is quiet on the log file front, but yet the system is down. What next?
Three parts practical know-how (“here’s my toolbox”) and one part position paper (“must-haves for comprehensibility”), this talk will cover the tricks of the trade for debugging distributed systems. Motivated by experience gained diagnosing Hadoop, we’ll dig into the JVM, Linux esoterica, and outlier visualization.
Billions of mobile phones worldwide leave vast volumes of geolocated data traces on the networks of operators. We present smart steps, a product created by Telefonica to provide insights to retailers on footfall volumes and trends across entire countries, turning these billions of data points into information that enables businesses to make decisions like where to open a shop or opening times.
Opower, the global leader in the field of energy information and analysis, works with 80 utility companies worldwide to give families context, insights, and advice about how to save energy. With access to an unprecedented (and still growing) amount of energy data—currently drawn from 50 million US homes—Opower is uncovering unique trends in how people are using energy at home.
More than ever before, students are using the Internet to study, leaving behind a trail of valuable data. How can we leverage this data to improve education?
Learn how Neustar has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. Discuss challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
HBase is one of the more popular open source NoSQL databases that have cropped up over the last few years. Building applications that use HBase effectively is challenging. This tutorial is geared towards teaching the basics of building applications using HBase and covers concepts that a developer should know while using HBase as a backend store for their application.
From markup languages like SVG to OpenGL based APIs like WebGL, the browser provides several ways for creating visualizations. In this talk we'll show some web based visualizations we worked on for different projects and for Twitter, and show what standards were used to create them. We'll dissect each example showing what was used not only for rendering but also for data handling and interaction.
In this talk, EA CTO Rajat Taneja will dive in to the challenges and complexities facing the gaming industry, how to harness the power of data and share examples of how technologies like machine learning and predictive analytics have been put in place to improve the customer experience.
Extending much of the hard work around big data, we'll focus on how do we take all these powerful tools and empower organizations to drive actionable decisions and strategies from data. We'll share what we've found exploring how human psychology, collaborative dynamics, gamification, and design can be utilized to not only improve what we're doing now, but drive where we are going.
From politicians to marketers everyone tries to influence. Data analytics of traditional as well as social media data has made it easier to spot deliberate attempts to skew the public opinion. The talk will give insights into new measurements by analyzing large events such as the London Olympics. Those measures will help to disguise the more and more sophisticated attempts of fake influence.
Microsoft keynote, featuring Dave Campbell, Vice President of Product Development for the SQL Server product suite.