On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote: > Le samedi 7 juin 2014 04:20:22 UTC+2, Tim Chase a écrit : >> On 2014-06-06 09:59, Travis Griggs wrote: >> >> > On Jun 4, 2014, at 4:01 AM, Tim Chase wrote: >> >> > > If you use UTF-8 for everything >> >> >> > >> > It seems to me, that increasingly other libraries (C, etc), use >> >> > utf8 as the preferred string interchange format. >> >> >> >> I definitely advocate UTF-8 for any streaming scenario, as you're >> >> iterating unidirectionally over the data anyways, so why use/transmit >> >> more bytes than needed. The only failing of UTF-8 that I've found in >> >> the real world(*) is when you have to requirement of constant-time >> >> indexing into strings. >> >> >> >> -tkc > > And once again, just an illustration, > >>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = 'z'") > [0.9457552436453511, 0.9190932610143818, 0.9322044912393039] >>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = '\u0fce'") > [2.5541921791045183, 2.52434366066052, 2.5337417948967413] >>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y = >>>> 'z'.encode('utf-8')") > [0.9168235779232532, 0.8989583403075017, 0.8964204541650247] >>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y = >>>> '\u0fce'.encode('utf-8')") > [0.9320969737165115, 0.9086006535332558, 0.9051715140790861] >>>> >>>> >>>> sys.getsizeof('abc'*1000 + '\u0fce') > 6040 >>>> sys.getsizeof(('abc'*1000 + '\u0fce').encode('utf-8')) > 3020 >>>> >>>> > > But you know, that's not the problem. > > When a see a core developper discussing benchmarking, > when the same application using non ascii chars become 1, 2, 5, 10, 20 > if not more, slower comparing to pure ascii, I'm wondering if there is > not a serious problem somewhere. > > (and also becoming slower that Py3.2) > > BTW, very easy to explain. > > I do not understand why the "free, open, what-you-wish-here, ... " > software is so often pushing to the adoption of serious corporate > products. > > jmf
Your error reports always seem to resolve around benchmarks despite speed not being one of Pythons prime objectives Computers store data using bytes ASCII Characters can be used storing a single byte Unicode code-points cannot be stored in a single byte therefore Unicode will always be inherently slower than ASCII implementation details mean that some Unicode characters may be handled more efficiently than others, why is this wrong? why should all Unicode operations be equally slow? -- There isn't any problem -- https://mail.python.org/mailman/listinfo/python-list