James Tauber

journeyman of some

blog > 2008 > 11 > 29 >

Bayesian Classification of Pages on This Site

A year ago, in Automatic Categorization of Blog Entries, I talked about automatically categorizing (or at least suggesting categories for) blog posts using a Bayesian classifier.

I finally decided to give it a go, using Reverend.

To train it, all I basically did was:

from reverend.thomas import Bayes
from leonardo.models import Page

guesser = Bayes()

for page in Page.objects.all():
    for category in page.categories.all():
        guesser.train(category.term, page.content)

Let's pick 10 random blog entries and see how it goes guessing them:

By "nothing conclusive" I mean that the highest guess was less than 2%. It is interesting that guesses were either < 2% or were around 40% and, in the latter case, they were always correct. So at least no false positives. I wonder what the reason for the false negatives were, though.

Next I tried it against all pages (that had a category). There were 284 cases where no prediction over 5% was made. But in the 288 cases where a prediction over 5% was made, in 287 cases the prediction was correct. In only 1 case was a wrong prediction over 5% made. And it was simply that the classifier thought poincare project should have been tagged "poincare project" :-)

So the precision was basically 100% but the recall 50%.

Categories:
prev « python » next
prev « this_site » next
prev « mathematics » next

Comments (6)

nnp on Nov. 29, 2008:

That's pretty interesting. What did you use as training data? Do the posts that were inconclusive look like the don't fit in with any of the categories given?

I must play around with Reverend a bit myself when I get some free time today. It seems like a pretty cool project. It might be interesting to see how the classifiers in the NLTK[1] do in comparison

[1] http://nltk.org/doc/api/nltk.classify-module.html

Orestis Markou on Nov. 29, 2008:

Very interesting. Keep in mind that to evaluate the performance, you have to train on different data than the ones you're using to evaluate. Eg:

100 articles, use 90 for training, use the 10 for evaluation.

Another approach is to train using 99 articles, use 1 for evaluation, and do that for *every* article. Then take the average of all the results.

I don't have a link ready since I did all this in uni, but I'm sure you can find something out there in the internets.

James Tauber on Nov. 29, 2008:

Orestis, you're quite right: I should run it again using your second approach.

Another thing to do is to include the title, not just the content, perhaps in a way that distinguishes words in the content from words in the title to allow it to weight the latter differently.

Simon on Nov. 29, 2008:

Maybe this is a good way to put more weight to the title?
"%s %s %s" % (page.title, page.title, page.content)

James Tauber on Nov. 29, 2008:

I wasn't going to explicitly give it more weight, just allow it to distinguish. I was thinking of changing each title word to title:Bayesian title:Classification title:of title:Pages title:on title:This title:Site or similar.

Simon on Nov. 29, 2008:

Ha silly me! Yes that would make more sense. I guess I have been focusing on search too much lately.

Are you leaving the case as is on purpose? You may want to make it all lowercase. Your not detecting spam so weather a page is using CamelCase or not probably shouldn't determine which category it goes into.

Created: Nov. 29, 2008
Last Modified: Nov. 29, 2008
Author: James Tauber