Bug#454772: wcatalan: Lots of repeated words in the list

Jordi Mallach Sun, 09 Dec 2007 10:43:44 -0800

On Sat, Dec 08, 2007 at 05:10:19PM +0100, Agustin Martin wrote:
> On Fri, Dec 07, 2007 at 08:03:31PM +0100, Marc Coll wrote:
> > The wordlist file is suprisingly big compared to the same file in
> > english or spanish (7.5 MB comapred to less than 1 MB). The cause
> > seems to be the fact that there are a lot of repeated words. A few
> > examples are: abacallani, embalsameu, embali...
> > 
> > I'm currently working on a little program which should be able to
> > find and remove all duplicated occurences. I'll send the corrected
> > version of the file to the package maintainer as soon as I get it to
> > work.
> I do not have the sources here, but a combination of sort and uniq
> during the build process should do the trick.


From the build proces in debian/rules:


        #       This generates the wcatalan wordlist.
        debian/strip_mwl | ispell -d $(CURDIR)/catala.debian -e | \
                tr -s ' ' '\n' | uniq > catala.words.debian

Weird. It is run through uniq, however Marc is right regarding the
examples he gave, like embalsameu.

Running:

  uniq /usr/share/dict/catala > /tmp/catala.uniq
results in identical files.

-- 
Jordi Mallach Pérez  --  Debian developer     http://www.debian.org/
[EMAIL PROTECTED]     [EMAIL PROTECTED]     http://www.sindominio.net/
GnuPG public key information available at http://oskuro.net/

signature.asc
Description: Digital signature

Bug#454772: wcatalan: Lots of repeated words in the list

Reply via email to