This tutorial will teach attendees about the key aspects of scalable web mining, via six modules:
1. Introduction
- Why web data is valuable
- Key challenges to web crawling
- Realistic definitions for success
2. Focused Web Crawling
- Reducing time & cost by focusing the crawl
- Approaches to classifying and scoring pages
- Solutions for scalable web crawling
3. Structured Data Extraction
- Data mining essentials
- Structured text extraction
- Automated vs. manual extraction
4. Analyzing the Data
- Making it searchable
- Finding "interesting" text
- Machine learning with Mahout
5. Barriers to Success
- Polite crawling versus deep crawling
- Spam, splog, honeypots and nasty webmasters
- Ajax, robots.txt and Facebook
6. Examples and Summary
- Hotel reviews
- Music pages
- SEO analysis
Veteran developer and entrepreneur, 25+ years experience. Founder and President of TransPac Software, a 20 year leader in internationalization, mobile devices, and search consulting. Founder and CTO of Krugle, a vertical search engine and enterprise appliance for code and technical information. Co-founder of Bixo web mining project. Committer for the Apache Tika project. Author and speaker on vertical search and web mining.
For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.
For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com
For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com
View a complete list of Strata contacts
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)