sábado, 11 de octubre de 2008

Linking words

Often, when I looked up synonyms in a thesaurus, I was surprised on how relatively easy it was to, starting from some random word, follow a chain of ‘synonyms’ which eventually led to another word, whose meaning was completely different from the original one.

I obviously started thinking, are maybe all words connected in a large graph of synonyms? Or are words clustered into several groups with more or less similar meanings? And, if so, how many of such groups are there? A few tens? some hundreds? If you want to try to guess, this is your last chance.

Being the maniac I am, and having some basic knowledge of graph theory, I got my hands on the digital version of a thesaurus, and hacked in some quick scripts to find the answers. These are the results.

I used The Oxford American Writer's Thesaurus, it is also the one used by the Dictionary application available on Mac's. The version of the thesaurus I used has 31'673 ‘senses’ (i.e. entries relating a word with many other of the same meaning), and a total of 52'307 different words and small phrases.

Apart from a small set of very disconnected 193 senses (more details later), the rest of the 31'480 word senses are all linked together in the same group of connected words. This means that, pretty much from every word in the thesaurus you can get to every other word just by following chains of synonyms.

Even words with opposite meanings such as ‘good’ and ‘bad’ are connected, and not very far apart. Just by looking into the entries of two senses one finds that ‘good’ is listed as synonym of ‘mean’ (in the sense of accomplished), while ‘mean’ is listed as synonym of ‘bad’ (in the sense of base). If you are skeptic, here are the two relevant entries taken directly from the thesaurus:

accomplished: an accomplished bassoonist expert, skilled, skillful, masterly, successful, virtuoso, master, consummate, complete, proficient, talented, gifted, adept, adroit, deft, dexterous, able, good, competent, capable, efficient, experienced, seasoned, trained, practiced, professional, polished, ready, apt; informal great, mean, nifty, crack, ace, wizard; informal crackerjack.

base2: base motives sordid, ignoble, low, low-minded, mean, immoral, improper, unseemly, unscrupulous, unprincipled, dishonest, dishonorable, shameful, bad, wrong, evil, wicked, iniquitous, sinful. antonym noble.


Some more interesting trivia facts: In this large connected group, every word is connected to every other word by following, in average, 3.81 senses. So paths between different words tend to be very short. The center of the thesaurus is the word

set

from which you can get to any other connected word in an average of 2.66 senses.

The most distant pair of words is only 9 senses apart. These are the short phrases ‘swimming trunks’ and ‘in any other way’ which are connected by the following chain of senses:

bathing suit: swimming trunks - bathing suit
swimsuit: bathing suit - trunks
luggage: trunks - luggage
possession: luggage - assets
saving: assets - saving
but: saving - but
but: but - on the other hand
alternatively: on the other hand - otherwise
otherwise: otherwise - in any other way


The ‘boring’ words, which are disconnected from everything else, are usually words which only list a few synonyms and/or some usage notes in the thesaurus. The biggest group of such words has only 5 related senses. For those with curiosity, this is the full list of disconnected senses.

Also for those interested, the kind of programming techniques that I used to compute this information are very similar to those used by Stephen Dolan on his Six Degrees of Wikipedia. Although, being my graph considerably smaller, the results were obtained just by leaving a single computer running overnight.

4 comentarios:

ZGRL dijo...

Wow, that was exactly what I was going to say, that it seems there is a relationship with the "six degrees theory" so in vogue this days.

I find all of this quiet interesting (obsesive and manic too, but hey, I'm also a little manic sometimes). Actually, from time to time, when I'm a little bored I just grab the dictionary and look up some strange words and it's meanings, and let me tell you (not that I have to, but anyway) there are some pretty messed up words and interpretations. and then when I realize it, time just went off.

Take care! :D

Juan dijo...

Hi ZGRL! Yep.. its pretty amazing how many useless things we can do with our (not always free) time :-D

Rafael Peñaloza dijo...

I once heard
"the human being is capable of doing anything, as long as it is not what he is supposed to be doing" ;)
(sorry, cannot find the source)

Juan dijo...

LOL! Yep, that sounds a lot like me!