Re: unicode() vs. s.decode()

Michael Fötsch Sat, 08 Aug 2009 09:11:07 -0700

Michael Ströder wrote:
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
> 17.23644495010376
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)
> 72.087096929550171
>
> That is significant! So the winner is:
>
> unicode('äöüÄÖÜß','utf-8')


Which proves that benchmark results can be misleading sometimes. :-)

unicode() becomes *slower* when you try "UTF-8" in uppercase, or anentirely different codec, say "cp1252":


  >>> timeit.Timer("unicode('äöüÄÖÜß','UTF-8')").timeit(1000000)
  2.5777881145477295
  >>> timeit.Timer("'äöüÄÖÜß'.decode('UTF-8')").timeit(1000000)
  1.8430399894714355
  >>> timeit.Timer("unicode('äöüÄÖÜß','cp1252')").timeit(1000000)
  2.3622498512268066
  >>> timeit.Timer("'äöüÄÖÜß'.decode('cp1252')").timeit(1000000)
  1.7812771797180176

The reason seems to be that unicode() bypasses codecs.lookup() if theencoding is one of "utf-8", "latin-1", "mbcs", or "ascii". OTOH,str.decode() always calls codecs.lookup().

If speed is your primary concern, this will give you even betterperformance than unicode():


  decoder = codecs.lookup("utf-8").decode
  for i in xrange(1000000):
      decoder("äöüÄÖÜß")[0]

However, there's also a functional difference between unicode() andstr.decode():

unicode() always raises an exception when you try to decode a unicodeobject. str.decode() will first try to encode a unicode object using thedefault encoding (usually "ascii"), which might or might not work.


Kind Regards,
M.F.

--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode() vs. s.decode()

Reply via email to