Automatic Categorization of Blog Entries
I just went through a couple of years of posts, tagging some of those written before I introduced categories to Leonardo (in particular, the ones about Alibi Phone Network). It occurred to me that this categorization could be a job for machine learning.
I've talked before about using techniques like Bayesian Classification for identifying posts in an aggregator that one might be more likely to be interested in. But it didn't occur to me until now that similar techniques could be used to suggest which existing categories to place uncategorized entries into.
At some stage in the near future, I'll try a quick implementation in Leonardo and see how it goes.
Comments (2)
James Tauber on Nov. 19, 2007:
Combine my studies, hobbies and work in one activity? That sounds too sensible.
Note I'm not talking about POS tagging. Just category tagging. A naïve bayesian classifier might work fine: if a blog entry mentions "python", there's an excellent chance I want to tag it "python". If it mentions "camera" and/or "lens", then it's probably either "photography" or "filmmaking".
I guess I'm not really even thinking of "automatic" categorization (despite the title of the post) -- just more suggesting likely categories as a usability aid.
Add a Comment
Last Modified: Nov. 18, 2007
Author: James Tauber
Doug Napoleone on Nov. 19, 2007:
Ever consider working on these types of problems for a living? Just curious. It's not like the company I work for in Burlington MA (www.nuance.com) is hiring, or my group looking for a developer.
NOTE: for tag learning, trigrams + POS (part of speech) is a good way to go about it, as the context a word appears in can be more important than the word its self. The POS information can be key to pick out the noun in the trigram, and be used as a filter for common words (is, the, etc).
Granted the easiest (but not very accurate) way to do machine learned POS tagging is a NBC. There are public POS tag DB's out there, many part of various NIST projects.