On Sun, Dec 09, 2012 at 02:46:39AM -0200, Nelson A. de Oliveira wrote:
> Package: wbrazilian
> Version: 3.0~beta4-15
> Severity: minor
> 
> Hi!
> 
> /usr/share/dict/brazilian has some duplicated words:
> 
> $ wc -l /usr/share/dict/brazilian
> 275502 /usr/share/dict/brazilian
> 
> $ sort -u /usr/share/dict/brazilian | wc -l
> 275170

Thanks for the info. This seems something related to the encoding. On the
one hand, I cannot reproduce it in my ISO-8859-1 locale,

$ wc -l  /usr/share/dict/brazilian
275502 /usr/share/dict/brazilian
$ sort -u   /usr/share/dict/brazilian | wc -l
275502

but get the same result as you when using UTF-8 encoding

$ LC_ALL=es_ES.UTF-8 sort -u   /usr/share/dict/brazilian | wc -l
275170

No words are duplicated (besides upper/lowercase versions for some of them),
just words are sorted under a locale different than its own and this causes
some mess. This is probably a bug in sort, testing differences after

$ LC_ALL=es_ES.UTF-8 sort -u /usr/share/dict/brazilian | 
LC_ALL=es_ES.ISO-8859-1 sort > test.txt

shows that some words were missing, but I do not see a pattern for this
missing words. I have seen a recent bug report about 'sort -u' data loss
(http://bugs.debian.org/685238), but it should already be fixed.

By the way, this reminds me that I should recode the wordlist to UTF-8.

Regards,

-- 
Agustin


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to