Andreas Tille wrote:
> The last remaining problem is in the WordNet dict conversion
> program that was kindly provided by Sebastian Hagen
> <[EMAIL PROTECTED]>.  It just happens that
> 
>    $ dictl -d wn test
> 
> works fine but
> 
>    $ dictl -d wn toast
>    No definitions found for "toast", perhaps you mean:
>    wn:  oast  boast  roast
> 
> Even if there is
> 
>    $ wn toast -over
>    Overview of noun toast
> 
>    The noun toast has 4 senses (first 1 from tagged texts)
>    ...
> 
> So this problem was just solved by Sebastian for WordNet 2.1 and
> I just hope that he could review his code to make things work also
> under 3.0.
It turns out that I never really solved this problem for Wordnet 2.1.
The bug I did find (s/_/ /) was relevant, and decreased the occurences of
this problem, which shifted it out of your testcase-set; but it didn't fix
the fundamental disagreement about a dict-db format detail between
wordnet_structures and dictd. For instance, looking up 'toad' doesn't work
in the wordnet 2.1 databases, either.
I only figured this out now that you found another instance in the wordnet
3.0 output.
Running dictd with '-d search' finally left me with an understanding of why
this is happening. When looking up words, dictd does a binary search over a
section of the index file to find any matching entry.
In the process, it seemed to consider 'to-do' to be bigger than 'toast'.

A comment in dictd's index.c finally cleared this up:

   The comparison must be the same as "sort -df" phone directory order:
   ignore all characters except letters, digits, and blanks; fold upper
   case into the equivalent lowercase.

Naturally, the 'DATABASE FORMAT' section in dictd(1) mentions none of this;
in fact, it doesn't even mention that the order of the entries in that file
had any relevance whatsoever.

Some dictd-code reading reveals that adding the '00-database-allchars'
headword to the index file changes the ordering expected by dictd's binary
search to ascii, in addition to the documented meaning of allowing lookup of
non-alphanumspace headwords. Both of these are appropriate for the wordnet
database.
I've adjusted wordnet_structures.py to insert that header, as well as put
the 00-* headwords into normal headword order.

Afaict this fixes the issue for dictd. I don't know what the situation is
like for other dict servers, but I'd hope they're not making even more
assumptions about the format of these files.
If there is another dict-server that requires a different ordering of
headwords with this set of 00-* headers, their dict-db format dialect is
almost certainly incompatible to that of 1.10.2.
I'm attaching the new version of wordnet_structures.py to this mail.


Regards,
Sebastian Hagen

P.S.: I'm not sending this to [EMAIL PROTECTED], as I have no
idea if it would be appropriate. Feel free to forward.


Attachment: wordnet_structures.py.gz
Description: application/gzip

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to