Michael Ströder wrote:
> >>> timeit.Timer("unicode('äöüÄÖÜß','utf-8')").timeit(10000000)
> 17.23644495010376
> >>> timeit.Timer("'äöüÄÖÜß'.decode('utf8')").timeit(10000000)
> 72.087096929550171
>
> That is significant! So the winner is:
>
> unicode('äöüÄÖÜß','utf-8')
Which proves that benchmark results can be misleading sometimes. :-)
unicode() becomes *slower* when you try "UTF-8" in uppercase, or an
entirely different codec, say "cp1252":
>>> timeit.Timer("unicode('äöüÄÖÜß','UTF-8')").timeit(1000000)
2.5777881145477295
>>> timeit.Timer("'äöüÄÖÜß'.decode('UTF-8')").timeit(1000000)
1.8430399894714355
>>> timeit.Timer("unicode('äöüÄÖÜß','cp1252')").timeit(1000000)
2.3622498512268066
>>> timeit.Timer("'äöüÄÖÜß'.decode('cp1252')").timeit(1000000)
1.7812771797180176
The reason seems to be that unicode() bypasses codecs.lookup() if the
encoding is one of "utf-8", "latin-1", "mbcs", or "ascii". OTOH,
str.decode() always calls codecs.lookup().
If speed is your primary concern, this will give you even better
performance than unicode():
decoder = codecs.lookup("utf-8").decode
for i in xrange(1000000):
decoder("äöüÄÖÜß")[0]
However, there's also a functional difference between unicode() and
str.decode():
unicode() always raises an exception when you try to decode a unicode
object. str.decode() will first try to encode a unicode object using the
default encoding (usually "ascii"), which might or might not work.
Kind Regards,
M.F.
--
http://mail.python.org/mailman/listinfo/python-list