When it comes to natural language processing, general APIs and generic models are often far less accurate than you want. Or maybe the APIs you need don’t even exist. Either way, you can use “corpus bootstrapping” to create custom models and APIs. Corpus bootstrapping is a method of rapidly producing a custom corpus for training highly accurate natural language processing models. For example, suppose you want to do sentiment analysis for Spanish text, but you can only find APIs and models for English. Or you want to do phrase extraction for phrases that are not exactly noun phrases. Maybe you want to classify text but there’s no corpus in existence with the categories you’re interested in. All of these problems can be solved by iterating your way to a custom corpus for training custom models.
This talk will cover:
Code examples will be in Python using NLTK.
Jacob is the cofounder & CTO of Weotta and the author of Python Text Processing with NLTK 2.0 Cookbook. He blogs at Streamhacker and has created both the NLTK Demos & APIs and NLTK-Trainer.
For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com.
For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com
For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com
View a complete list of Strata contacts
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)