On Sun, Dec 09, 2012 at 02:46:39AM -0200, Nelson A. de Oliveira wrote: > Package: wbrazilian > Version: 3.0~beta4-15 > Severity: minor > > Hi! > > /usr/share/dict/brazilian has some duplicated words: > > $ wc -l /usr/share/dict/brazilian > 275502 /usr/share/dict/brazilian > > $ sort -u /usr/share/dict/brazilian | wc -l > 275170
Thanks for the info. This seems something related to the encoding. On the one hand, I cannot reproduce it in my ISO-8859-1 locale, $ wc -l /usr/share/dict/brazilian 275502 /usr/share/dict/brazilian $ sort -u /usr/share/dict/brazilian | wc -l 275502 but get the same result as you when using UTF-8 encoding $ LC_ALL=es_ES.UTF-8 sort -u /usr/share/dict/brazilian | wc -l 275170 No words are duplicated (besides upper/lowercase versions for some of them), just words are sorted under a locale different than its own and this causes some mess. This is probably a bug in sort, testing differences after $ LC_ALL=es_ES.UTF-8 sort -u /usr/share/dict/brazilian | LC_ALL=es_ES.ISO-8859-1 sort > test.txt shows that some words were missing, but I do not see a pattern for this missing words. I have seen a recent bug report about 'sort -u' data loss (http://bugs.debian.org/685238), but it should already be fixed. By the way, this reminds me that I should recode the wordlist to UTF-8. Regards, -- Agustin -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org