Re: utf8 works on my dev server but not on my production server

Karen Tracey Mon, 01 Sep 2008 22:43:38 -0700

On Mon, Sep 1, 2008 at 9:11 PM, Max <[EMAIL PROTECTED]> wrote:

>
> Karen,
> thanks for your answer,
> here is some more details on what I am trying to do. I over-simplified
> I m sorry:
>
> 1 - I would like to realize a search on name fields encode in utf8 so
> that a search with key Remi entered by user reutrns all entries that
> are equivalent to their canonical form (Rémi would be returned in our
> example)



So you have a model with a CharField 'name'?  And you'd like to be able to
search on the 'name' field and if the user enters 'Remi' get both 'Remi'
matches and 'Rémi'?  You do not need to write code to do that yourself,
specifying a non-binary collation in MySQL should do it automatically for
you.  From my own db, where I do not have any 'Rémi's but I do have some
'Chichén's which match your example pretty closely since they both have
e-with-acute:

Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from crossword.models import Clues
>>> cl = Clues.objects.filter(Clue__icontains='Chichen')
>>> cl
[<Clues: MAYA: Chichén Itzá builder>, <Clues: ITZA: Chichén ____ (Mayan
ruins)>, <Clues: ITZA: Chichén ____ (Mayan ruins)>, <Clues: MAYA: Chichén
Itzá builders>, <Clues: ITZA: Chichén ____: Mayan ruins>, <Clues: MAYANS:
Chichen Itza denizens>, <Clues: RUINS: Chichen Itza attraction>, <Clues:
ITZA: Chichén ____ (Mayan city)>, <Clues: MAYA: Chichén Itzá resident>,
<Clues: ITZA: Chichén ____ (Mayan city)>, <Clues: ITZA: Chichén ____ (Mayan
ruins)>, <Clues: MAYAN: Chichen Itza builder>, <Clues: RUINS: Chichén Itzá
attraction>, <Clues: ITZA: Chichen ____: Mayan ruins>]
>>> cl2 = Clues.objects.filter(Clue__icontains='Chichén')
>>> cl2
[<Clues: MAYA: Chichén Itzá builder>, <Clues: ITZA: Chichén ____ (Mayan
ruins)>, <Clues: ITZA: Chichén ____ (Mayan ruins)>, <Clues: MAYA: Chichén
Itzá builders>, <Clues: ITZA: Chichén ____: Mayan ruins>, <Clues: MAYANS:
Chichen Itza denizens>, <Clues: RUINS: Chichen Itza attraction>, <Clues:
ITZA: Chichén ____ (Mayan city)>, <Clues: MAYA: Chichén Itzá resident>,
<Clues: ITZA: Chichén ____ (Mayan city)>, <Clues: ITZA: Chichén ____ (Mayan
ruins)>, <Clues: MAYAN: Chichen Itza builder>, <Clues: RUINS: Chichén Itzá
attraction>, <Clues: ITZA: Chichen ____: Mayan ruins>]
>>>

I get results that match both 'Chichén' and 'Chichen' regardless of which
one I specify.  My DB happens to be encoded in latin1,
with default collation latin1_swedish_ci, but I'd expect a non-binary utf8
collation to behave the same way.

You do need to use a case-insensitive lookup (icontains, iexact, istartwith,
etc.) to ensure the database default collation is used and a binary match it
not attempted.


>
> For this I pull the data from DB and use unicodedata 's normalize
> function on foth unicode_data
>
>      unic = string.decode('utf-8')
>      normalized = unicodedata.normalize('NFKD', unic)
>      return normalized.encode('ASCII', 'ignore')
>
> This one is throwing the error on the first decode
> Your suggestion that it is already unicode is really interesting. But
> I tried without the first decode and it still crashes.
>

Unless I've misunderstood what you're looking for and you have determined
that relying on the database collation to do this for you won't work, I
really don't think you need this code at all.  However, if we were to pursue
debugging it, more specifics than "it still crashes" would be helpful.  I
don't know from that statement if you get the exact same error or something
a little different.

Also, it might still be worth pursuing figuring out why your model data
seems to be coming back as unicode on one machine and string on the other.
This really shouldn't be happening.  Printing out
type(your_model.your_fieldname) somewhere after you've retrieved the data
from the database would indicate what you're getting.  If it's different on
the two machines you probably want to figure out why, or else you're likely
going to be continuing to hit problems as you move code from development to
production.



> 2 - I am trying to pass some string to javascript so I would like to
> transform all ' by \'
> ########################################################
> # stringEditApostrophe
> # Find all apostophes and slashes and delimit with a slash
> ########################################################
> def stringEditApostrophe( vsText):
>    lsMethod = className +  '.stringEditApostrophe'
>
>    try:
>        lsFinal = vsText.replace("\\", "\\\\")
>        lsFinal = lsFinal.replace("'", "\\'")
>        lsFinal = lsFinal.replace('%', '%%')
>
>    except Exception, e:
>        #Log what happened
>        logging.debug(lsMethod + ".Error: " + str(e))
>
>    return lsFinal
>
> It seems that this method is not recognizing any of the ' char.
>
>
This seems completely unrelated to the above problem?  You probably want to
post it in its own thread because I don't speak javascript and it's quite
possible people who do have tuned out this thread because they don't speak
unicode encoding errors.


> But ABOVE ALL pease look at this strange thing: i tried to simply
> recreate your example to understand how it works and I got an error
> even on a basic decode. Am I missing something? (note the 2.4.4
> version)
>

What I showed should work on 2.4 as well as 2.5; I have both and for me they
do not behave differently for this.


>
> Python 2.4.4 (#2, Apr  5 2007, 20:11:18)
> [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>
> >>> s = 'Rémi'
> >>> type(s)
> <type 'str'>
> >>> s.decode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "encodings/utf_8.py", line 16, in decode
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3:
> invalid data
>
>
This looks like your terminal encoding is actually something other than
utf-8.  I can replicate this by manually setting (Terminal -> Set Character
Encoding) my terminal encoding to Western (WINDOWS-1252) instead of the
default UTF-8.  Then when I enter the string literal 'Rémi' it is actually
encoded in cp1252 and trying to decode it as utf-8 fails with the same error
as you show.  So, check your terminal encoding.  If it is UTF-8 I'm puzzled
by that error.

Karen

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: utf8 works on my dev server but not on my production server

Reply via email to