Consolidating replies to two mails:

Andreas Tille wrote:
> I have no idea how strict we should be about the length of lines
> in the output.  I just adopted this check from the previous maintainer.
> Moreover the problem is just in the starting comment:
> 
> $ grep "^.\{73,\}" wn.dict
> the following copyright notice and statements, including the disclaimer, 
> WordNet 2.1 Copyright 2005 by Princeton University.  All rights reserved.
> 
> that is contained in all input files.  What is your opinion about this?
> I think I will change the check to
> 
>         if [ ${LINESWRONG} -gt 2 ] ; then
> 
> which leaves us alone with this stuff and if wordnet_structures.py
> would really produce several to long lines we would notice this anyway. 
Sounds good to me. The lines are still well below 80 columns, even with the
extra space added by standard dict(1) output. I don't think rewrapping the
license text to try to stick to a possibly arbitrary 72 column limit is a
good idea.

> I just implemented the "allow two long lines sloppyness test" and
> putted the result into my private package repository that now is
> able to build the dict-wn package again.  I installed it and you
> are right that
> 
>         dict -d wn test
> 
> works now perfectly, but ...
> 
> $ wordnet toast -over
> 
> Overview of noun toast
> 
> The noun toast has 4 senses (first 1 from tagged texts)
> 
> 1. (5) toast -- (slices of bread that have been toasted)
> 2. toast -- (a celebrity who receives much acclaim and attention; "he
> was the toast of the town")
> 3. goner, toast -- (a person in desperate straits; someone doomed; "I'm
> a goner if this plan doesn't work"; "one mistake and you're toast")
> 4. pledge, toast -- (a drink in honor of or to the health of a person or
> event)
> 
> Overview of verb toast
> 
> The verb toast has 2 senses (first 2 from tagged texts)
> 
> 1. (5) crispen, toast, crisp -- (make brown and crisp by heating; "toast
> bread"; "crisp potatoes")
> 2. (1) toast, drink, pledge, salute, wassail -- (propose a toast to;
> "Let us toast the birthday girl!"; "Let's drink to the New Year")
> 
> 
> $ dict -d wn toast
> No definitions found for "toast", perhaps you mean:
> wn:  oast  boast  coast  roast
> 
> 
> So we now have a slightly better dict-wn but it took me only three
> tests to find a missing word.  Any chance to find the problem in the
> conversion code?
Funny thing, this one.
For one thing, not all dict servers get conniptions about it; serpento deals
with the produced index and data files ok, and can look 'toast' up.
Since dictd(8) doesn't produce much in the way of debug output even when
using --verbose, it took me a while to see the problem.

dictd(8) can't deal with headwords that contain underscores. In fact, it
doesn't just fail to look them up, it completely loses track and fails to
look up words close behind them in the data file, too.

Wordnet, OTOH, can't deal with headwords that contain spaces, and replaces
those with underscores in their data files, presumably under the assumptions
that none of their words actually contain underscores; that's fine with me
here, dict(8) couldn't deal correctly with those anyway.
I didn't know either of those peculiarities, though I could have guessed at
wordnet not keeping spaces as spaces; they use \x20 for field separation, so
their format really doesn't allow it.

Actually fixing it was then a matter of adding a few bytes of code to
wordnet_structures.py to do string replacements in two places. Looking up
'toast' works for me now.

I'm attaching the newest version of wordnet_structures; aside from the
bugfix this one produces output that is even closer to what wnfilter did
(purely indentation changes), and doesn't cache whole data files in memory.
Peak RAM-usage is down to 111mb on x86 and 233mb on x86_64.

Regards,
Sebastian Hagen

Attachment: wordnet_structures.py.gz
Description: application/gzip

Reply via email to