[Python-Dev] _PyBytesWriter/_PyUnicodeWriter could be faster

2020-10-25 Thread Ma Lin
Some code needs to maintain an output buffer that has an unpredictable size. 
Such as bz2/lzma/zlib modules, _PyBytesWriter/_PyUnicodeWriter.

In current code, when the output buffer grows, resizing will cause unnecessary 
memcpy().

issue41486 uses memory blocks to represent output buffer in bz2/lzma/zlib 
modules, it could eliminate the overhead of resizing.

There are benchmark charts in issue41486: https://bugs.python.org/issue41486


_PyBytesWriter/_PyUnicodeWriter could use the same way.

If write a "general blocks output buffer", it could be used in 
_PyBytesWriter/bz2/lzma/zlib. (issue41486 is not very general, it uses a bytes 
object to represent a memory block.)

If write a new _PyUnicodeWriter like this, it has a chance to eliminate the 
overhead of switching PyUnicode_Kind (record the switching position):

'a' * 100_000_000 + '\uABCD'

If anyone has time and is willing to try, it's very welcome.
Or I might do this at sometime in the future.
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/UMB52BEZCX424K5K2ZNPWV7ZTQAGYL53/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: _PyBytesWriter/_PyUnicodeWriter could be faster

2020-10-28 Thread Ma Lin
Thanks for your very informative reply.

I replied you in issue41486. Maybe memory blocks will not bring performance 
improvement to _PyBytesWriter/_PyUnicodeWriter, which is a bit frustrating.

> For a+b, Python first computes "a", then "b", and finally "a+b". I don't see 
> how your API could optimize such code.

I mean this situation:

s = 'a' * 100_000_000 + '\uABCD'
b = s.encode('utf-8')
b.encode('utf-8')  # <- this situation

I realize I was wrong, the UCS1->UCS2 transformation will only be done once, it 
only saves a memcpy().

Even in this case it will only save two memcpy():

s = 'a' * 100_000_000 + '\uABCD' * 100_000_000 + '\U00012345'
b = s.encode('utf-8')
b.encode('utf-8')
___
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/KFSOMXABV3OHLL3MW3MULYONVIP6O2WT/
Code of Conduct: http://python.org/psf/codeofconduct/