Michael Ströder wrote:
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
> 17.23644495010376
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)
> 72.087096929550171
>
> That is significant! So the winner is:
>
> unicode('äöüÄÖÜß','utf-8')

Which proves that benchmark results can be misleading sometimes. :-)

unicode() becomes *slower* when you try "UTF-8" in uppercase, or an entirely different codec, say "cp1252":

  >>> timeit.Timer("unicode('äöüÄÖÜß','UTF-8')").timeit(1000000)
  2.5777881145477295
  >>> timeit.Timer("'äöüÄÖÜß'.decode('UTF-8')").timeit(1000000)
  1.8430399894714355
  >>> timeit.Timer("unicode('äöüÄÖÜß','cp1252')").timeit(1000000)
  2.3622498512268066
  >>> timeit.Timer("'äöüÄÖÜß'.decode('cp1252')").timeit(1000000)
  1.7812771797180176

The reason seems to be that unicode() bypasses codecs.lookup() if the encoding is one of "utf-8", "latin-1", "mbcs", or "ascii". OTOH, str.decode() always calls codecs.lookup().

If speed is your primary concern, this will give you even better performance than unicode():

  decoder = codecs.lookup("utf-8").decode
  for i in xrange(1000000):
      decoder("äöüÄÖÜß")[0]


However, there's also a functional difference between unicode() and str.decode():

unicode() always raises an exception when you try to decode a unicode object. str.decode() will first try to encode a unicode object using the default encoding (usually "ascii"), which might or might not work.

Kind Regards,
M.F.

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to