Bug#454772: wcatalan: Lots of repeated words in the list

Agustin Martin Sun, 09 Dec 2007 12:43:09 -0800

On Sun, Dec 09, 2007 at 07:32:46PM +0100, Jordi Mallach wrote:
> On Sat, Dec 08, 2007 at 05:10:19PM +0100, Agustin Martin wrote:
> > On Fri, Dec 07, 2007 at 08:03:31PM +0100, Marc Coll wrote:
> > > The wordlist file is suprisingly big compared to the same file in
> > > english or spanish (7.5 MB comapred to less than 1 MB). The cause
> > > seems to be the fact that there are a lot of repeated words. A few
> > > examples are: abacallani, embalsameu, embali...
> > > 
> > > I'm currently working on a little program which should be able to
> > > find and remove all duplicated occurences. I'll send the corrected
> > > version of the file to the package maintainer as soon as I get it to
> > > work.
> > I do not have the sources here, but a combination of sort and uniq
> > during the build process should do the trick.
> 
> From the build proces in debian/rules:
> 
> 
>         #       This generates the wcatalan wordlist.
>         debian/strip_mwl | ispell -d $(CURDIR)/catala.debian -e | \
>                 tr -s ' ' '\n' | uniq > catala.words.debian


What about

         #       This generates the wcatalan wordlist.
         debian/strip_mwl | ispell -d $(CURDIR)/catala.debian -e | \
                 tr -s ' ' '\n' | sort -u > catala.words.debian

using sort with the --unique (-u) option.

> Weird. It is run through uniq, however Marc is right regarding the
> examples he gave, like embalsameu.
> 
> Running:
> 
>   uniq /usr/share/dict/catala > /tmp/catala.uniq
> results in identical files.

You can test with 

$ sort -u /usr/share/dict/catala > catala.tmp

sizes: 

6519080 catala.tmp
7450965 /usr/share/dict/catala

$ grep -n embalsameu catala.tmp
221517:embalsameu

$ grep -n embalsameu  /usr/share/dict/catala
264507:embalsameu
264520:embalsameu

-- 
Agustin



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#454772: wcatalan: Lots of repeated words in the list

Reply via email to