"Marco Bizzarri" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
On Mon, Sep 1, 2008 at 3:25 PM,  <[EMAIL PROTECTED]> wrote:


When I do ${urllib.unquote(c.user.firstName)} without encoding to
latin-1 I got different chars than I will get: no Łukasz but Å ukasz
--
http://mail.python.org/mailman/listinfo/python-list

That's crazy. "string".encode('latin1') gives you a latin1 encoded
string; latin1 is a single byte encoding, therefore taking the first
byte should be no problem.

Have you tried:

urlib.unquote(c.user.firstName)[0].encode('latin1') or

urlib.unquote(c.user.firstName)[0].encode('utf8')

I'm assuming here that the urlib.unquote(c.user.firstName) returns an
encodable string (which I'm absolutely not sure), but if it does, this
should take the first 'character'.

The OP stated that the original string was "encoded in UTF-8 and urllib.quote()", so after urllib.unquote the string is in UTF-8 format. This must be decoded into a Unicode string before removing the first character:

   urllib.unquote(c.user.firstName).decode('utf-8')[0]

The next problem is that the character in the OP's example string 'Ł' is not present in the latin-1 encoding, but using utf-8 encoding demonstrates that the full two-byte UTF-8 encoded character is collected:

   >>> import urllib
   >>> name = urllib.quote(u'Łukasz'.encode('utf-8'))
   >>> name
   '%C5%81ukasz'
   >>> urllib.unquote(name).decode('utf-8')[0].encode('utf-8')
   '\xc5\x81'

-Mark

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to