Andreas Tille wrote: > The last remaining problem is in the WordNet dict conversion > program that was kindly provided by Sebastian Hagen > <[EMAIL PROTECTED]>. It just happens that > > $ dictl -d wn test > > works fine but > > $ dictl -d wn toast > No definitions found for "toast", perhaps you mean: > wn: oast boast roast > > Even if there is > > $ wn toast -over > Overview of noun toast > > The noun toast has 4 senses (first 1 from tagged texts) > ... > > So this problem was just solved by Sebastian for WordNet 2.1 and > I just hope that he could review his code to make things work also > under 3.0. It turns out that I never really solved this problem for Wordnet 2.1. The bug I did find (s/_/ /) was relevant, and decreased the occurences of this problem, which shifted it out of your testcase-set; but it didn't fix the fundamental disagreement about a dict-db format detail between wordnet_structures and dictd. For instance, looking up 'toad' doesn't work in the wordnet 2.1 databases, either. I only figured this out now that you found another instance in the wordnet 3.0 output. Running dictd with '-d search' finally left me with an understanding of why this is happening. When looking up words, dictd does a binary search over a section of the index file to find any matching entry. In the process, it seemed to consider 'to-do' to be bigger than 'toast'.
A comment in dictd's index.c finally cleared this up: The comparison must be the same as "sort -df" phone directory order: ignore all characters except letters, digits, and blanks; fold upper case into the equivalent lowercase. Naturally, the 'DATABASE FORMAT' section in dictd(1) mentions none of this; in fact, it doesn't even mention that the order of the entries in that file had any relevance whatsoever. Some dictd-code reading reveals that adding the '00-database-allchars' headword to the index file changes the ordering expected by dictd's binary search to ascii, in addition to the documented meaning of allowing lookup of non-alphanumspace headwords. Both of these are appropriate for the wordnet database. I've adjusted wordnet_structures.py to insert that header, as well as put the 00-* headwords into normal headword order. Afaict this fixes the issue for dictd. I don't know what the situation is like for other dict servers, but I'd hope they're not making even more assumptions about the format of these files. If there is another dict-server that requires a different ordering of headwords with this set of 00-* headers, their dict-db format dialect is almost certainly incompatible to that of 1.10.2. I'm attaching the new version of wordnet_structures.py to this mail. Regards, Sebastian Hagen P.S.: I'm not sending this to [EMAIL PROTECTED], as I have no idea if it would be appropriate. Feel free to forward.
wordnet_structures.py.gz
Description: application/gzip
signature.asc
Description: OpenPGP digital signature