James Tauber : James Tauber's Blog 2005/01

blog > 2005 >

Leonardo 0.5 Beta 1 Released

I've released version 0.5b1 of my Python blog/wiki/CMS software Leonardo. It's available at: http://jtauber.com/2005/leonardo/leonardo-0.5b1.tgz

There will be a handful more minor enhancements before a release candidate but I'm keen to get 0.5 wrapped up so I can start on trackbacks, comments and categories which will be the main themes of 0.6

by : Created on Jan. 31, 2005 : Last modified Feb. 8, 2005 : (permalink)

Film Project Update: Festival Announcements Soon

According to Matt Dentler's Blog, SxSW will be announcing the short film program on February 14th so I guess I'll know by then at the latest whether Alibi Phone Network got in.

The other 17 festivals we've submitted to so far will be sending out notifications at various times over the next few months.

No acceptances yet but no rejections yet either so there's hope.

by : Created on Jan. 30, 2005 : Last modified Feb. 8, 2005 : (permalink)

jtauber, jtauberer, jtauberest

Recently a Technorati search for "Tauber" I subscribe to came up with various references to a Joshua Tauberer who won the Technorati Developers Contest for GovTrack.us.

Well, I discovered from Language Log that Joshua Tauberer is a doctoral student in linguistics at UPenn.

We actually have common interests within linguistics: formal syntax, dependency grammar, etc.

I got in contact with Josh and he pointed out how amusing it would be if we co-authored a paper. It's not entirely out of the question.

by : Created on Jan. 29, 2005 : Last modified Feb. 8, 2005 : (permalink)

Poincare Project: More on Separation Axioms

We previously defined types of topological spaces called T_0, T_1 and T_2 spaces. I've tried below to capture the distinction between them informally with a diagram.

diagram showing differences between T0, T1 and T2 spaces

Recall that a topological space is:

T_0 iff for every pair of points x, y there is an open set U such that either x is in U but y is not or y is in U but x is not.
T_1 iff for every pair of points x, y there are open sets U and V such that x is in U but y is not and y is in V but x is not.
T_2 iff for every pair of points x, y there are open sets U and V such that x is in U, y is in V and U and V are disjoint.

UPDATE: next post

by : Created on Jan. 29, 2005 : Last modified Feb. 8, 2005 : (permalink)

Happy Birthday Denise Tauber

Today is my mother's 56th birthday.

She taught me about family and about love. She taught me to listen to my conscience. She has always been there for me. Over the last few years, she's become a really good friend.

I love you mum. Happy Birthday!

by : Created on Jan. 27, 2005 : Last modified Feb. 8, 2005 : (permalink)

BetaCode to Unicode in Python

BetaCode is a common ASCII transcription for Polytonic Greek. I've been dealing with it for around twelve years. (As an aside, back in 1994, I designed a METAFONT for Polytonic Greek that enabled one to use BetaCode in TeX—I typeset my self-published Index to the Greek New Testament with it).

For the last six years, my preference has been to use Unicode, so I wrote a program (initially in Java but then in Python) that used a Trie to represent the multiple BetaCode characters that can map to a single pre-composed Unicode character.

I've had a version available on this site since 2002, but I've now updated it to what I've been using for my most recent work. You can download it at http://jtauber.com/2004/11/beta2unicode.py

At some stage I'll better factor out the conversion pairs so the code is useful for other conversions. The Trie code might be useful for other contexts too.

(Also see Ricoblog's Converting Greek Beta Code into Normalized Unicode.)

by : Created on Jan. 27, 2005 : Last modified Feb. 8, 2005 : (permalink)

Poincare Project: Separation Axioms

The definition of topological spaces is very general and allows for some rather unusual spaces that have properties quite different from R^n. Put another way, there are some intuitive properties one might expect of a topological space which turn out to not necessarily be true from definition.

For example, there is nothing in the definition which says that two points can't be in exactly all the same open sets. However, two points in exactly all the same open sets are topologically indistinguishable from one another. Topologically, they are the same point.

An example is the space {a,b,c} with the topology {{}, {a,b}, {a,b,c}} where there is no topological distinction between a and b.

Even though the definition allows it, we will be restricting ourselves to topological spaces where every distinct pair of points is topological distinct. In other words, for any two points, there is an open set containing one but not the other. Such spaces are called T_0 spaces.

Furthermore, we will be dealing with spaces such that, for any two points, each is in an open set that the other one isn't. Such spaces are called T_1 spaces. The definition sounds very similar to T_0 but it is slightly more restrictive. T_0 only requires that one of the points is in an open set the other one isn't. T_1 requires this to be true of both points in the pair. Clearly all T_1 spaces are T_0 spaces.

A space that is T_0 but not T_1: {a,b} with the topology {{}, {a}, {a,b}}.

A further restriction turns out to be necessary in order to guarantee some of the intuitive characteristics of things like the real numbers.

In a T_1 space, we require that for any two points x and y, x is in an open set that y isn't (let's call it U) and y is in an open set that x isn't (let's call it V). There is nothing that says U and V are disjoint. They can intersect (as long as neither x nor y are in that intersection).

If a disjoint U and V exist for each pair of points, then we have what is called a T_2 space.

It may seem an arbitrary restriction to go from a T_1 to a T_2 but it turns out that this additional requirement is what allows us to define a metric on a space or take unique limits of sequences.

The additional axioms defined for T_0, T_1 and T_2 spaces say something about how separated the points have to be. For this reason they are referred to as separation axioms. There are more (T_3, T_4, etc) but, for our purposes, it is the T_2 axiom (and the T_1 and T_0 axioms that it implies) that are important to us.

T_2 spaces are also called Hausdorff spaces. From this point, pretty much all the topological spaces we deal with will be T_2/Hausdorff spaces. One cute way to remember the meaning of Hausdorff spaces is to think that for any pair, each point as "housed-off" from the other (i.e. is in an open set disjoint from at least one of the open sets the other point is in).

UPDATE: next post

by : Created on Jan. 27, 2005 : Last modified Aug. 10, 2007 : (permalink)

Upgraded Leonardo on this site

I've upgraded Leonardo on this site to an early beta of 0.5. Apologies if I've inadvertently broken anything.

New features that impact you, the reader:

new blog calendar on left (thanks to Bryan Lawrence)
blog page and atom feed only show latest 20 entries (thanks to Min Sik Kim)
feed supports 304 Not Modified

The last two should greatly reduce my bandwidth usage (and help yours too).

by : Created on Jan. 23, 2005 : Last modified Feb. 8, 2005 : (permalink)

Poincare Project: More on Compactness

Previously, we defined what it means for a topological space to be compact. The definition ("every open covering has a finite subcovering") is precise but hard to get an intuitive understanding of (well, it was hard for me).

I found it helpful to have some examples of well-known spaces and whether they are compact or not:

the real numbers (under the order topology) is NOT compact.
any open interval of the real numbers is NOT compact.
any closed interval of the real numbers IS compact.
a circle IS compact.
a sphere IS compact.

There is an informal sense in which non-compact spaces keep on going, whereas compact spaces stop (or return you to where you started).

Within the context of the Poincaré Conjecture, we will largely be narrowing the spaces we are interested in to those that are compact.

UPDATE: next post

by : Created on Jan. 22, 2005 : Last modified Aug. 9, 2007 : (permalink)

Tim on Tags

Tim Bray has a great post on tags. Some of the topics:

technorati tags versus using dc:subject
URIs for tags and rolling your own versus using someone else's
tagging the html page (with link) versus other techniques which will include the tag in the feed entry
the value of specifying a scheme to ground the meaning of a category / tag

plus some great closing questions, none of which I have answers for myself, except maybe why I think I need categories in Leonardo.

Here are two user stories I wrote on the Leonardo mailing list:

Story #1: Albert occasionally says some good things about FOO on his blog and Planet FOO is interested in aggregating them. However, they don't want to aggregate Albert's non-FOO posts so they'd like a feed just of his FOO topics.

Story #2: Betty is working on a project and provides updates on her blog. She'd like to have a page that just contains the entries relating to this project.

by : Created on Jan. 20, 2005 : Last modified Feb. 8, 2005 : (permalink)

Comparison with A-List Blogger

Not sure if he would call himself an A-List Blogger, but Jeremy Wright's blog is certainly right up there. He's just published some stats so I thought it would be interesting to do a comparison between him and someone whose blog is a little further down the Long Tail (me!)

PubSub Rank: 441 versus 43,827

Technorati Rank: 1,458 versus 34,528

Monthly visitors: 208,000 versus either 8,000 (unique IP) or 36,000 (visits)

Monthly pageviews: 510,000 versus 92,000

But here is an interesting one that I looked up directly:

Bloglines subscribers: 207 versus 144

Which raises the interesting question: why does Jeremy have so few Bloglines subscribers given his other stats?

by : Created on Jan. 20, 2005 : Last modified Feb. 8, 2005 : (permalink)

Nerd God

Okay, so according to this test I'm a Nerd God with a score of 95.

by : Created on Jan. 19, 2005 : Last modified Feb. 8, 2005 : (permalink)

DATR, MorphGNT, RDF and Python

I've been revisiting DATR, the lexical knowledge representation language, as a possible format for the next generation of MorphGNT. I was previously considering developing my own RDF/graph-based format but I suddenly remembered DATR from my student days and it makes a lot more sense to use it rather than try to build my own.

Looking at DATR material, I haven't seen anything more recent than 1998 so I'm not sure if it's still the state-of-the-art. It's a natural fit for some kind of RDFization, something I'm sure I'll eventually end up doing if someone hasn't already.

Of course, I'll have to write Python code to manipulate DATR. Again, unless some already exists. But I'm almost hoping not as I love implementing specs, especially using test-driven development.

UPDATE 2005-04-19: Now see DATR in Python

by : Created on Jan. 19, 2005 : Last modified April 19, 2005 : (permalink)

More on Lost CGI Environment Variables in Python 2.4

I previously mentioned problems a user was having running Leonardo under Python 2.4 on Windows and that I'd narrowed it down to CGIHTTPServer and the environment not getting populated.

Looks like others have had the same problem.

Pierre-André Côté pointed me to this bug report and fix at SourceForge and Markus Schramm suggested subclassing CGIHTTPServer with this workaround:

# There seems to be a bug in Python 2.4.0, that I could reproduce under
# Win98SE and WinXP Home SP1a (Python 2.3.4 works OK for both systems).
#
# The CGI variables are set inside the current process. Normally the
# CGI script will be executed in a new subprocess, but without this
# workaround the variables are not accessible there.
#
# Expected reason (some additional tests done):
# os.popen3(..) and os.popen4(..) do not correctly pass the modified
# os.environ to the new subprocess (Windows platforms only).
#
# Workaround:
# Redefine some class variable values to force a fallback mode that
# executes the CGI script in the current process.
#
if 'win' in sys.platform and sys.version_info >= (2, 4):
    have_fork = have_popen2 = have_popen3 = False

Markus also suggested the following override (unrelated to the lost environment problem)

# Overridden to not call socket.getfqdn(host), that doesn't work at
# all machines and is very very time consuming (several seconds) then.
# Note: Merely used for logging to the console.
#
def address_string(self):
    return '%s:%s' % self.client_address2

Thanks also to Jeff de Wet.

by : Created on Jan. 11, 2005 : Last modified Feb. 8, 2005 : (permalink)

Learning a Language with Pimsleur

I've just finished the Pimsleur Italian I course. I cannot recommend enough the Pimsleur approach to language learning if you are learning on your own. It is expensive, but having completed my first, I think it's money well spent (and I've already bought Italian II)

I thought I'd share some tips I've picked up along the way. I should note that, although you can get through it in 30 days, it took me much longer due to various false starts. Which leads me to my first tip:

Tip #1

If you miss doing it for more than a few days, consider starting again. At least go back five to ten lessons. At first I couldn't bring myself to do it but then I reminded myself that the objective was to learn Italian, not get through the CDs in record time.

Tip #2

If possible, do one full lesson a day (30 minutes, although it drops to 25 after Lesson 9, the case of Italian I, if you postpone the reading as I did). If you don't have 30 minutes a day, try overlapping over two days. e.g. 0:00-20:00 the first day, 10:00-30:00 the second day.

Tip #3

Never pause the CD to give you more time to answer. Much better to get used to thinking on your feet. If that's too hard, see Tip #4.

Tip #4

If you are having trouble with a lesson, go back and repeat the previous one. I found this incredibly useful and it enabled me to get through difficult patches (which I did find came every 5 lessons or so).

by : Created on Jan. 11, 2005 : Last modified Feb. 8, 2005 : (permalink)

Petals Around the Rose

I'd heard about the dice-based brain teaser Petals Around the Rose but didn't read the details until tonight when I followed Bob Congdon's link to this page.

At the outset of reading the latter, I decided I would try to work it out myself and not cheat by Googling the answer. I had a couple of hypotheses but none of them worked past one or two of the sample rolls given.

By the end of the page, I hadn't worked it out. Bob Congdon had suggested "the less you think about it the easier it is to solve" so I stopped thinking about it all together and started getting ready for bed.

Then it all of a sudden hit me! I rushed back to the computer and tested my hypothesis. I was right every time!

Go read the article for yourself. Even if you don't get it right away, stop thinking and let it come to you. Nothing beats working it out yourself.

by : Created on Jan. 11, 2005 : Last modified Feb. 8, 2005 : (permalink)

Lost CGI Environment Variables in Python 2.4

Did something change between 2.3 and 2.4 that would affect os.environ being populated with CGI variables when using the BaseHTTPServer/CGIHTTPServer?

I received a bug report from someone trying to run Leonardo under Python 2.4 on Windows 2000. I was able to reproduce the problem under Python 2.4 on Windows XP Pro and confirmed that it worked fine under Python 2.3.

Simply printing out os.environ at the point things like PATH_INFO are extracted by Leonardo revealed that os.environ contained no CGI-related variables when run under Python 2.4 but did contain them under 2.3

I can't see in the code for the http server modules that anything changed in this area. Am I missing something?

by : Created on Jan. 9, 2005 : Last modified Feb. 8, 2005 : (permalink)

Annotated Word Association Sketch

One of my favourite (and perhaps the cleverest) Monty Python sketch is John Cleese's Word Association.

Below is a transcript of the sketch which I have annotated according to an initial analysis performed by my sister Jenni and I. There is an underlying talk being given and I have marked that up in bold. I have tried to group the phrases resulting from a word association on distinct lines and have repeated in parentheses where a word forms part of two overlapping word associations that aren't part of the main text or where a difference of spelling exists. (To recover the sketch, just ignore the parentheticals.)

Tonight's the night

I shall be talking

a-bout of flu

the subject of word

association football.

This is a techn-

-ique (eke) out a living

much used in the

practice makes perfect

of psychoanaly-

sister and brother

and one that has occu-

pied piper

the

majority rule

of my

attention squad by the right number one two three four (for)

the last five

years (here's) to the memory.

It is quite remark-

able baker charlie

how

much the miller's son

this so-

called while you were out

word associ-

ation (Asian) immigrants' problems

influences the

manner (manna) from heaven

in which we

(Wee), sleekit, cowrin, tim'rous beastie

(beasti)-al

(all-)American

speak (Speke), the famous explorer.

And the

really well that is surprising

partner in crime

is that a

lot (Lot) and his wife

of the

lions' feeding time

we may

be c d e eff-

ectively quite unaware of the

fact or fiction section of the Watford Public Library

that we are even doing

it is a far, far better thing that I do now

(now) then, now then, what's going on

(on)ward Christian

(Christian) Barnard the famous heart

(heart)y part of the lettuce

(let us) now praise famous men

(men)tal homes for loonies like me.

So (sew) on the button,

my con-

tention (tension) causing all the headaches,

is that unless we take into ac-

count of Monte Cristo

in our thin-

king George the Fifth

this phenomen-

on the other hand

we shall not be able satis-

Fact or Fiction section of the Watford Public Library again

-ily to under-

stand to attention when I'm talking to you and stop laughing,

about human nature, man's psychological

make-up some story the wife'll believe

and hence the very meaning of life it-

selfish bastard, I'll kick him in the balls

(Balls) Pond Road.

UPDATE (2005-01-10): Thanks to Tim Bulkeley for his suggestion for "now praise famous mental homes for loonies like me."

UPDATE (2005-01-10): At the suggestion of my other sister, Leonie, I did a Google search for sleekit and came up with a poem by Robert Burns that begins: "Wee, sleekit, cowrin, tim'rous beastie"

UPDATE (2005-01-20): Further refined "beastie all-American"

UPDATE (2005-01-25): Made Lot clearer and corrected "eke out a living". Added "here's to the memory" (thanks to Bill Keller) and "Much the Miller's Son" (thanks to Bill Keller and Jason Hildebrand)

UPDATE (2005-02-22): Corrected the last line to be a reference to Balls Pond Road in London (thanks to Matt Peirse and Graham Douglas)

by : Created on Jan. 9, 2005 : Last modified Feb. 21, 2005 : (permalink)

Bill Gates and the Creative Communists

By now most people in the blogosphere have heard about Bill Gates's statement, in response to a question on whether intellectual property rights need to be reformed, that "There are some new modern-day sort of communists who want to get rid of the incentive for musicians and moviemakers and software makers under various guises. They don't think that those incentives should exist."

A lot of people have been up-in-arms about the characterisation of groups like the Creative Commons folk as communists, but even if Bill was talking about Creative Commons, many of the criticisms I've read seem to miss the main point.

The key point to make, in the context of Creative Commons, is that CC isn't about legal reform—it's about helping creators to convey their licensing intentions within current copyright laws.

Yes I think current copyright terms are stupid, but your works don't have to be subject to them if you don't want. As a creator of the work, you have the control.

That is what was stupid about Michael Moore being okay about illegal copies of Fahrenheit 9/11 before the election. If he was the copyright holder, and he wanted it to be freely distributable for non-commercial purposes, he could have made the film available under a CC-like license. It's ridiculous to reserve all rights (or assign them to an entity that does) and then complain that people should be allowed to copy the work.

If Bill Gates was talking about Creative Commons, then his comment was a straw man. CC is about helping creators to realise the flexibility they have. To give them choices. Even expand the incentives. And there are great market opportunities for publishers, music distributors, etc who want to work with this flexibility too. Artist doesn't like the deal from the label? Go somewhere else like Magnatune. Consumer doesn't like the redistribution terms of the song? Don't buy it.

Some people find incentive in money, others in fame, others in making a lasting contribution. As long as people are free to pursue any or all of those paths, that sounds pretty good to me. If someone truly was wanting to get rid of incentives (of any kind), then I'd have a problem. In as much as Bill was saying that, then I agree with him.

UPDATE (2005-01-10): See Glen Otis Brown's post on the Creative Commons blog. Notice he characterises CC as a "voluntary, market-based approach to copyright". Just that one phrase pretty much makes the point this entire blog entry was trying to. And it pretty neatly sums up why I'm a fan of CC.

by : Created on Jan. 9, 2005 : Last modified Feb. 8, 2005 : (permalink)

Feedster Interesting Feeds of the Day

I just discovered Feedster's Interesting Feeds of the Day and, to my shock, discovered I was chosen as the Interesting Feed of the Day back on 14th November.

Given some of the other blogs linked to, I can't help but feel (like I did when I was put alongside people like Eric Schmidt and Ray Ozzie by Network World magazine in 2000) that it's just a mistake that will soon be corrected.

Then again, they didn't say, "good", or "enjoyable" or "informative". Just "interesting".

by : Created on Jan. 7, 2005 : Last modified Feb. 8, 2005 : (permalink)

Delicious Trackbacks and Leonardo

Peter Sefton has another great post about his team's intended use of del.icio.us to share bookmarks within the group.

The downsides Peter points out got me thinking about Leonardo acting as a delicious server.

The advantage of del.icio.us (the actual site, not the software or idea) is that it aggregates a lot in one place. But for more specialised categories, running a delicious-like service for your domain of interest isn't a bad idea and that's where Leonardo could come in.

I've previously suggested that trackbacks could be used for annotating resources and that categories could be viewed as resources that entries in that category track back to. Pinging delicious is really just like a trackback but the actor isn't necessarily the source and the target is a category/tag rather than another entry.

So a team could set up a Leonardo server (once the functionality I'm talking about has been implemented) and set up categories for the team. When they come across a resource of interest they use a delicious/trackback-like API to tell that Leonardo server about the resource.

Of course, there's nothing specific to Leonardo there. See, for example, this delicious clone (via Steve Mallett).

Another interesting result is that you've essentially namespaced your tags. So "leonardo" on jtauber.com can mean the software without clashing with other senses the tag might be used for.

by : Created on Jan. 7, 2005 : Last modified Feb. 8, 2005 : (permalink)

Leonardo 0.4.1 Released

Leonardo is the Python software that runs this site, providing a blog and wiki-like content management system for personal websites. Leonardo requires Python 2.3 but no additional software as it uses the filesystem directly for storage.

0.4.1 fixes a bug that prevented editing of the css stylesheet.

Leonardo 0.4.1 is available at: http://jtauber.com/2005/leonardo/leonardo-0.4.1.tgz

by : Created on Jan. 7, 2005 : Last modified Feb. 8, 2005 : (permalink)

Who's Coming to SxSW?

I registered back in September but now that it's getting closer, I thought I'd ask again if there's anyone reading this that is planning on attending. Send me an email. We'll catch up.

The speakers for the Interactive stream have been announced. The Interactive stream is where I know the most people but, like last year, I'm also attending the Music and Film streams as well.

And, with any luck, Alibi Phone Network will be at the film festival!

by : Created on Jan. 6, 2005 : Last modified Feb. 8, 2005 : (permalink)

NetNewsWire and Flagging Items

When I came back in October from my extended trip to the US, I thought I'd try out NetNewsWire and I've used it ever since (previously I was using Bloglines in a browser).

I've done a good job of getting my unread entries to zero each day but that's misleading because what I find myself doing is flagging items and never going back to them.

What I need at a bare minimum is a display of how many flagged items there are. NetNewsWire has a virtual folder that automatically contains all flagged items. It would be nice if it displayed not only the number of unread items but the overall total.

Without a total count, it's easy to forget you've got stuff in the folder and having an exact number makes it much easier to set goals like "I'll keep my flagged items below 100".

In fact, as a general rule, hierarchical containers of items to read should show both the unread count and the total count. I find it frustrating that most email clients don't show both. But for now I'd just be happy if NetNewsWire did it.

I flag items for a variety of reasons:

it's a long entry that looks interesting but I don't have time right now to read it
I've read it but it was so good I want to keep it
it's an entry I want to reply to
it's an entry that's given me an idea for an entry on my own blog
it contains a link to something of interest

So the second thing that would be nice is multiple flag types. That way I could distinguish entries to keep from ones I need to act on in a timely fashion. I'm thinking just a variety of colours and a virtual folder for each flag type.

Hopefully these two features are on their way in NetNewsWire. Brent?

by : Created on Jan. 6, 2005 : Last modified Feb. 8, 2005 : (permalink)

Translations, Glosses, Tags and Folksonomies

There's been some recent discussion on Slashdot and in the blogosphere on the incremental, bottom-up taxonomies ("folksonomies") created via tags in things like del.icio.us and Flickr.

Beside the fact that I've long been interested in taxonomies, I've been thinking about some of these issues recently because (a) I'll soon be implementing categories in Leonardo; (b) I've just started reading John Lee's A History of New Testament Lexicography (which, for all you New Testament Greek scholars out there, is a must read).

What does New Testament Lexicography have to do with del.icio.us tags? Read on.

When I'm explaining to people some of the challenges with translation and reading translated works (whether the New Testament or any other work), I like to use the following Venn diagram:

Two intersecting circles, one marked A, the other marked B. The intersection is marked '2'; the part of A not intersecting B is marked '1'; the part of B not intersecting A is marked '3'

Consider A to be the word in the original language and the circle on the left to represent the range of possible meanings of that word. A translator chooses to translate A as the word B, with the circle on the right representing the range of possible meanings of that word.

Very few words match up between two languages. There will senses of A that B doesn't have (marked '1' above) and senses of B that A doesn't have (marked '3')

The first thing that can go wrong is the translator assuming the wrong sense of A. If the original author meant '1' then B will be a bad translation.

But even if the translator gets the sense right there is still the possibility that the reader of the translation will assume the wrong sense of B (marked '3').

This challenge arises not only in translating texts but also in dictionaries and this is where Lee's book is so fascinating. Looking up an individual word in a bilingual dictionary is subject to the same challenge, particularly if the dictionary just provides a gloss rather than a full definition. In just providing a gloss (an equivalent word in the target language) there is a risk that a user of the dictionary will take the wrong sense of the gloss.

Full definitions are generally much better, although, as Lee points out there are cases where a gloss does just fine and is even preferable. χιών is adequately defined by the gloss snow and there is no need to define χιών as "the aqueous vapour of the atmosphere precipitated in partially frozen crystalline form and falling to the earth in white flakes" (which is how one dictionary, cited by Lee, defines "snow").

In the realm of New Testament Lexicography, lexicons such as Louw and Nida's Greek-English Lexicon of the New Testament Based on Semantic Domains does an excellent job of teasing out the different senses of Greek words and making clearer which senses of corresponding English words they map to.

What does all this mean for tags? There is a tremendous practicality in tag-based folksonomies but they do suffer from many of the same problems as glossing. Perhaps the biggest issue is disambiguation. A given tag can have multiple senses.

Say I used the tag "leonardo" for my software. I'd then need to come up with a different tag if I wanted to talk about Leonardo da Vinci. If I'd talked about the latter first and chosen "leonardo" for him, I would have then needed to come up with a different tag for my software.

That doesn't sound that big a deal, but in a common tag set, it's much more difficult to coordinate that kind of disambiguation. Someone might have already started using "leonardo" for one sense and another come along and used it for another sense without realising.

In a way, the problem is that the tags are their own gloss. There's no definition of what their sense or scope is. How might one provide a disambiguated version of a tag, without adding complexity that would drive people away from using them at all? Using URIs instead of tags is, of course, the "right" thing to do (in as much as it would provide a unique identify for each sense) but it just won't fly with the majority of Flickr or even del.icio.us users.

That's why previously, I suggested wikipedia as the basis for disambiguation. Wikipedia provides an excellent platform for disambiguation, not at the level a lexicographer or translator might expect, but good enough that it would provide enough benefit for the cost in folksonomy tag disambiguation.

Also see Tag the Tags which suggests an easy way to add expressiveness to the tagging approach to classification without adding too much complexity.

by : Created on Jan. 5, 2005 : Last modified Nov. 9, 2005 : (permalink)

Give Elsewhere Too

With all the coverage the Tsunami disaster is getting, it's easy to forget there are other places in the world hurting too.

So here's a suggestion: pick another project or appeal your favourite aid organisation is currently working on and match dollar-for-dollar what you gave for the Tsunami appeal. (I chose the Red Cross's Sudan Appeal.)

Who knows, it might become an addiction :-)

by : Created on Jan. 5, 2005 : Last modified Feb. 8, 2005 : (permalink)

Tag the Tags

After my previous entry on Translations, Glosses, Tags and Folksonomies, I started thinking about some of the other limitations with tags as well, including normalising synonyms and expressing hierarchy or grouping.

One solution could solve both. If you could have a tag that meant the union of certain other tags, you could create a parent tag for all the synonyms. Taxonomists might shudder at the conflation of synonymity with grouping but it seems entirely appropriate for a folksonomy.

Of course, any tag should be allowed to have multiple parents where it fits into multiple larger categories (noting that parents generalise, not specialise)

How might these grouping be expressed? Well it just dawned on me:

Tag the Tags

If tags themselves could be tagged, a much richer taxonomy could be built. You could have tag groupings, tag types, etc. And none of it would interfere with the existing data.

Of course, it's poor-man's RDF with only one property and tags instead of URIs, but, hey, it just might work for folksonomies.

by : Created on Jan. 5, 2005 : Last modified Feb. 8, 2005 : (permalink)

Film Project Update: DVDs Finally Arrive

I previously recounted the ongoing saga of the DVDs of Alibi Phone Network that Tom sent UPS Next Day Air on 18th December.

Well, they finally arrived, 17 days after they were sent. The wait was worth it, though. They came out really nicely.

by : Created on Jan. 4, 2005 : Last modified Feb. 8, 2005 : (permalink)

Priority Levels

I've talked about the difference between priority and severity and proposed possible severity levels. Here's my current thinking on priority levels in issue tracking systems for software development (and specifically Leonardo's Roundup tracker).

Most systems I've seen simply use numbers for priorities: P1, P2, P3, P4 and P5, for example. To really know which one to use, groups end up assigning particular meaning to them.

Outside of software development, I've found it useful to think of priority in terms of modals like:

must
should
could

and, where appropriate adding things like:

shouldn't
can't (i.e. blocked)

I think it's useful in software development to think of those modals in the context of releases. e.g. this bug must get fixed for 0.5 or that enhancement should get in 0.6.

This is already better than P1-P5 but there are some complications that need to be addressed.

Firstly, if one assumes that maintence patches are still taking place on previous releases, it is possible for issues to have priorities relative to multiple releases. For example, a bug might be a "must for 0.5" and a "should for 0.4.1". How would one express that in an issue tracking system?

Secondly, one probably needs finer grained priorities on occasions when, of the 20 things that must or should get into the next release, there are 5 that must get done in the next week and another 5 that should.

The second is less of an issue for Leonardo in my opinion, but it would still be nice to address.

One way of addressing the former, especially if one assumes that one is only actively maintaining one version prior (reasonable at this stage of Leonardo at least) is to have separate priority fields: one for the upcoming patch and one for the upcoming new version.

So both "patch priority" and "trunk priority" could have values like:

urgent
must this release
should this release
could this release
not this release
N/A

What priorities have others found useful?

by : Created on Jan. 3, 2005 : Last modified Feb. 8, 2005 : (permalink)

A Number of People Were High

A friend called me last night asking whether "number of casualties" is singular or plural. Apparently the words had been uttered by a reporter and there had been some debate at the friend's house as to whether the reporter had got the agreement between the verb and subject correct.

My intuition was that the following are both grammatical:

a number of casualties were reported.
the number of casualties was reported.

and that a with was and the with were would both be ungrammatical.

Thinking about it some more, I came up with another example pair.

a number of people were high.
the number of people was high.

That made things a little clearer to me (although looking back at the original pair, it's now obvious there too). With the indefinite article a, the heads of the phrases are casualties and people (both plural) whereas with the definite article the, the heads of the phrases are both number (singular).

In "a number of casualties were reported", it's the casualties that were reported; in "the number of casualties was reported", it's the number that was reported.

Even more clearly, in "a number of people were high", it's the people that were high; in "the number of people was high", it's the number that was high.

That would all suggest structures along the lines of:

((a number of) people) were high.
(the number (of people)) was high.

The structure explains the subject-verb agreement and the semantics.

The interesting question in my mind remains: what it is about the article that determines which structure is licensed in each case?

by : Created on Jan. 3, 2005 : Last modified Nov. 2, 2005 : (permalink)

More on Priority and Severity

Previously I talked about wanting to separate priority and severity in Roundup and proposed some severity levels:

security or safety issue
major problem with no known workaround
major problem with known workaround
minor inconvenience
cosmetic

At the time I left open how to handle features and what priority levels could be.

I just did a Google search on 'priority severity'. My previous blog entry was actually the first hit. The second hit was a page on the original c2 wiki espousing the principle DifferentiatePriorityAndSeverity.

Some commentators suggested a distinction was a nice idea in theory but too hard for submitters in practice. What I am suggesting, though, is that the submitter need only worry about severity, the developer fixing the bug about priority and only the people responsible for triage really need to worry about the relationship between the two.

One commentator mentioned Microsoft's severity levels as being:

crash or loss of data
feature doesn't work
aspect of a feature doesn't work
cosmetic

This seems reasonable, although the second and third might get blurry unless it's clear what the granularity of a "feature" is (which is probably spelt out in specs at Microsoft). It also doesn't take into account whether a workaround exists which I think is important.

I do like the calling out of a crash or loss of data. And I did have a slight problem with my own earlier list in distinguishing "major" and "minor" which the Microsoft list doesn't worry about.

With regard to features, I think it is helpful to distinguish new features from enhancements to existing features. This may suffer from some of the same granularity problem mentioned earlier for the Microsoft severity levels but I think it makes sense in the context of a particular project's plan.

As I mentioned in my previous entry, I think there's also a need to handle code cleanup, refactoring and other internal enhancements. I previously suggested a possible separate class, but I think it can be done with severity. I think it also helps to have a catch-all "general tasks" for when it isn't clear there is (yet) a suitable level to assign the issue to.

So one possible severity level list would be:

B1 security or safety issue
B2 crash or loss of data
B3 major problem with no known workaround
B4 major problem with known workaround
B5 minor inconvenience
B6 cosmetic
general task
E3 code cleanup or refactoring (internal enhancement)
E2 enhancement to existing feature (user-visible enhancement)
E1 new feature

Note that in both the B-series and E-series, the lower number means greater significance.

Next up, some thoughts on priority.

by : Created on Jan. 3, 2005 : Last modified Feb. 8, 2005 : (permalink)

Leonardo 0.4.0 Released

I am pleased to announce the release of Leonardo 0.4.0.

Leonardo 0.4.0 is a complete re-architecture designed to facilitate future implementation of a wide variety of features including trackbacks, comments and latex-generated images.

Leonardo 0.4.0 is available at: http://jtauber.com/2005/leonardo/leonardo-0.4.0.tgz

by : Created on Jan. 2, 2005 : Last modified Feb. 8, 2005 : (permalink)