A year ago, in Automatic Categorization of Blog Entries, I talked about automatically categorizing (or at least suggesting categories for) blog posts using a Bayesian classifier.
I finally decided to give it a go, using Reverend.
To train it, all I basically did was:
from reverend.thomas import Bayes from leonardo.models import Page guesser = Bayes() for page in Page.objects.all(): for category in page.categories.all(): guesser.train(category.term, page.content)
Let's pick 10 random blog entries and see how it goes guessing them:
By "nothing conclusive" I mean that the highest guess was less than 2%. It is interesting that guesses were either < 2% or were around 40% and, in the latter case, they were always correct. So at least no false positives. I wonder what the reason for the false negatives were, though.
Next I tried it against all pages (that had a category). There were 284 cases where no prediction over 5% was made. But in the 288 cases where a prediction over 5% was made, in 287 cases the prediction was correct. In only 1 case was a wrong prediction over 5% made. And it was simply that the classifier thought poincare project should have been tagged "poincare project" :-)
So the precision was basically 100% but the recall 50%.
The original post had 6 comments I'm in the process of migrating over.