James Tauber : Bayesian Classification of Pages on This Site

A year ago, in Automatic Categorization of Blog Entries, I talked about automatically categorizing (or at least suggesting categories for) blog posts using a Bayesian classifier.

I finally decided to give it a go, using Reverend.

To train it, all I basically did was:

from reverend.thomas import Bayes
from leonardo.models import Page

guesser = Bayes()

for page in Page.objects.all():
    for category in page.categories.all():
        guesser.train(category.term, page.content)

Let's pick 10 random blog entries and see how it goes guessing them:

Pinax Progress III (nothing conclusive)
The Big 040 (personal 38.7%) CORRECT!
TeX for Leonardo (leonardo 42.8%) CORRECT!
Disclosure: Trying Out Google Analytics (nothing conclusive)
Demokritos 0.1.0 Released (nothing conclusive)
Programming Languages I've Learned In Order (software craftsmanship 44.8%) CORRECT!
New Site Look (nothing conclusive)
Implementing the Unicode Collation Algorithm in Python (nothing conclusive)
Leopard After Two Weeks (os x 43.5%) CORRECT!
Film Project Update: A Little More Editing (nothing conclusive)

By "nothing conclusive" I mean that the highest guess was less than 2%. It is interesting that guesses were either < 2% or were around 40% and, in the latter case, they were always correct. So at least no false positives. I wonder what the reason for the false negatives were, though.

Next I tried it against all pages (that had a category). There were 284 cases where no prediction over 5% was made. But in the 288 cases where a prediction over 5% was made, in 287 cases the prediction was correct. In only 1 case was a wrong prediction over 5% made. And it was simply that the classifier thought poincare project should have been tagged "poincare project" :-)

So the precision was basically 100% but the recall 50%.