On Mon, Sep 1, 2008 at 9:11 PM, Max <[EMAIL PROTECTED]> wrote: > > Karen, > thanks for your answer, > here is some more details on what I am trying to do. I over-simplified > I m sorry: > > 1 - I would like to realize a search on name fields encode in utf8 so > that a search with key Remi entered by user reutrns all entries that > are equivalent to their canonical form (Rémi would be returned in our > example)
So you have a model with a CharField 'name'? And you'd like to be able to search on the 'name' field and if the user enters 'Remi' get both 'Remi' matches and 'Rémi'? You do not need to write code to do that yourself, specifying a non-binary collation in MySQL should do it automatically for you. From my own db, where I do not have any 'Rémi's but I do have some 'Chichén's which match your example pretty closely since they both have e-with-acute: Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole) >>> from crossword.models import Clues >>> cl = Clues.objects.filter(Clue__icontains='Chichen') >>> cl [<Clues: MAYA: Chichén Itzá builder>, <Clues: ITZA: Chichén ____ (Mayan ruins)>, <Clues: ITZA: Chichén ____ (Mayan ruins)>, <Clues: MAYA: Chichén Itzá builders>, <Clues: ITZA: Chichén ____: Mayan ruins>, <Clues: MAYANS: Chichen Itza denizens>, <Clues: RUINS: Chichen Itza attraction>, <Clues: ITZA: Chichén ____ (Mayan city)>, <Clues: MAYA: Chichén Itzá resident>, <Clues: ITZA: Chichén ____ (Mayan city)>, <Clues: ITZA: Chichén ____ (Mayan ruins)>, <Clues: MAYAN: Chichen Itza builder>, <Clues: RUINS: Chichén Itzá attraction>, <Clues: ITZA: Chichen ____: Mayan ruins>] >>> cl2 = Clues.objects.filter(Clue__icontains='Chichén') >>> cl2 [<Clues: MAYA: Chichén Itzá builder>, <Clues: ITZA: Chichén ____ (Mayan ruins)>, <Clues: ITZA: Chichén ____ (Mayan ruins)>, <Clues: MAYA: Chichén Itzá builders>, <Clues: ITZA: Chichén ____: Mayan ruins>, <Clues: MAYANS: Chichen Itza denizens>, <Clues: RUINS: Chichen Itza attraction>, <Clues: ITZA: Chichén ____ (Mayan city)>, <Clues: MAYA: Chichén Itzá resident>, <Clues: ITZA: Chichén ____ (Mayan city)>, <Clues: ITZA: Chichén ____ (Mayan ruins)>, <Clues: MAYAN: Chichen Itza builder>, <Clues: RUINS: Chichén Itzá attraction>, <Clues: ITZA: Chichen ____: Mayan ruins>] >>> I get results that match both 'Chichén' and 'Chichen' regardless of which one I specify. My DB happens to be encoded in latin1, with default collation latin1_swedish_ci, but I'd expect a non-binary utf8 collation to behave the same way. You do need to use a case-insensitive lookup (icontains, iexact, istartwith, etc.) to ensure the database default collation is used and a binary match it not attempted. > > For this I pull the data from DB and use unicodedata 's normalize > function on foth unicode_data > > unic = string.decode('utf-8') > normalized = unicodedata.normalize('NFKD', unic) > return normalized.encode('ASCII', 'ignore') > > This one is throwing the error on the first decode > Your suggestion that it is already unicode is really interesting. But > I tried without the first decode and it still crashes. > Unless I've misunderstood what you're looking for and you have determined that relying on the database collation to do this for you won't work, I really don't think you need this code at all. However, if we were to pursue debugging it, more specifics than "it still crashes" would be helpful. I don't know from that statement if you get the exact same error or something a little different. Also, it might still be worth pursuing figuring out why your model data seems to be coming back as unicode on one machine and string on the other. This really shouldn't be happening. Printing out type(your_model.your_fieldname) somewhere after you've retrieved the data from the database would indicate what you're getting. If it's different on the two machines you probably want to figure out why, or else you're likely going to be continuing to hit problems as you move code from development to production. > 2 - I am trying to pass some string to javascript so I would like to > transform all ' by \' > ######################################################## > # stringEditApostrophe > # Find all apostophes and slashes and delimit with a slash > ######################################################## > def stringEditApostrophe( vsText): > lsMethod = className + '.stringEditApostrophe' > > try: > lsFinal = vsText.replace("\\", "\\\\") > lsFinal = lsFinal.replace("'", "\\'") > lsFinal = lsFinal.replace('%', '%%') > > except Exception, e: > #Log what happened > logging.debug(lsMethod + ".Error: " + str(e)) > > return lsFinal > > It seems that this method is not recognizing any of the ' char. > > This seems completely unrelated to the above problem? You probably want to post it in its own thread because I don't speak javascript and it's quite possible people who do have tuned out this thread because they don't speak unicode encoding errors. > But ABOVE ALL pease look at this strange thing: i tried to simply > recreate your example to understand how it works and I got an error > even on a basic decode. Am I missing something? (note the 2.4.4 > version) > What I showed should work on 2.4 as well as 2.5; I have both and for me they do not behave differently for this. > > Python 2.4.4 (#2, Apr 5 2007, 20:11:18) > [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > > >>> s = 'Rémi' > >>> type(s) > <type 'str'> > >>> s.decode('utf-8') > Traceback (most recent call last): > File "<stdin>", line 1, in ? > File "encodings/utf_8.py", line 16, in decode > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: > invalid data > > This looks like your terminal encoding is actually something other than utf-8. I can replicate this by manually setting (Terminal -> Set Character Encoding) my terminal encoding to Western (WINDOWS-1252) instead of the default UTF-8. Then when I enter the string literal 'Rémi' it is actually encoded in cp1252 and trying to decode it as utf-8 fails with the same error as you show. So, check your terminal encoding. If it is UTF-8 I'm puzzled by that error. Karen --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---