Large scale web mining

Ken Krugler (Scale Unlimited)
Data Science
Location: Ballroom E

This tutorial will teach attendees about the key aspects of scalable web mining, via six modules:

1. Introduction

- Why web data is valuable
- Key challenges to web crawling
- Realistic definitions for success

2. Focused Web Crawling

- Reducing time & cost by focusing the crawl
- Approaches to classifying and scoring pages
- Solutions for scalable web crawling

3. Structured Data Extraction

- Data mining essentials
- Structured text extraction
- Automated vs. manual extraction

4. Analyzing the Data

- Making it searchable
- Finding "interesting" text
- Machine learning with Mahout

5. Barriers to Success

- Polite crawling versus deep crawling
- Spam, splog, honeypots and nasty webmasters
- Ajax, robots.txt and Facebook

6. Examples and Summary

- Hotel reviews
- Music pages
- SEO analysis

Ken Krugler

Scale Unlimited

Veteran developer and entrepreneur, 25+ years experience. Founder and President of TransPac Software, a 20 year leader in internationalization, mobile devices, and search consulting. Founder and CTO of Krugle, a vertical search engine and enterprise appliance for code and technical information. Co-founder of Bixo web mining project. Committer for the Apache Tika project. Author and speaker on vertical search and web mining.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Sponsors

  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

View a complete list of Strata contacts