On Mon, 21 Apr 2008 08:33:47 +0200, Hunter wrote: > I've narrowed the problem down to a simple test program. Check this out: > > --- > > # -*- coding: utf-8 -*- > > acceptable = "abcdefghijklmnopqrstuvwxyzóíñú" # this line will work > acceptable = "abcdefghijklmnopqrstuvwxyzóíñúá" # this line won't > #wtf? > > word = "¡A" > word_key = ''.join([c for c in word.lower() if c in acceptable]) > print "word_key = " + word_key > > --- > > Any ideas? I'm really stumped!
You are not working with unicode but UTF-8 encoded characters. That's bytes and not letters/characters. Your `word` for example contains three bytes and not the two characters you think it contains: In [43]: word = "¡A" In [44]: len(word) Out[44]: 3 In [45]: for c in word: print repr(c) ....: '\xc2' '\xa1' 'A' So you are *not* testing if ¡ is in `acceptable` but the two byte values that are the UTF-8 representation of that character. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list