Mark Bell <mark00b...@googlemail.com> added the comment: So I have taken a look at the original patch that was provided and I have been able to update it so that it is compatible with the current release. I have also flipped the logic in the wrapping functions so that they take a `keepempty` flag (which is the opposite of the `prune` flag).
I had to make a few extra changes since there are now some extra catches in things like PyUnicode_Split which spot that if len(self) > len(sep) then they can just return [self]. However that now needs an extra test since that shortcut can only be used if len(self) > 0. You can find the code here: https://github.com/markcbell/cpython/tree/split-keepempty However in exploring this, I'm not sure that this patch interacts correctly with maxsplit. For example, ' x y z'.split(maxsplit=1, keepempty=True) results in ['', '', 'x', 'y z'] since the first two empty strings items are "free" and don't count towards the maxsplit. I think the length of the result returned must be <= maxsplit + 1, is this right? I'm about to rework the logic to avoid this, but before I go too far could someone double check my test cases to make sure that I have the correct idea about how this is supposed to work please. Only the 8 lines marked "New case" show new behaviour, all the other come from how string.split works currently. Of course the same patterns should apply to bytestrings and bytearrays. ''.split() == [] ''.split(' ') == [''] ''.split(' ', keepempty=False) == [] # New case ' '.split(' ') == ['', '', ''] ' '.split(' ', maxsplit=1) == ['', ' '] ' '.split(' ', maxsplit=1, keepempty=False) == [' '] # New case ' a b c '.split() == ['a', 'b', 'c'] ' a b c '.split(maxsplit=0) == ['a b c '] ' a b c '.split(maxsplit=1) == ['a', 'b c '] ' a b c '.split(' ') == ['', '', 'a', 'b', 'c', '', ''] ' a b c '.split(' ', maxsplit=0) == [' a b c '] ' a b c '.split(' ', maxsplit=1) == ['', ' a b c '] ' a b c '.split(' ', maxsplit=2) == ['', '', 'a b c '] ' a b c '.split(' ', maxsplit=3) == ['', '', 'a', 'b c '] ' a b c '.split(' ', maxsplit=4) == ['', '', 'a', 'b', 'c '] ' a b c '.split(' ', maxsplit=5) == ['', '', 'a', 'b', 'c', ' '] ' a b c '.split(' ', maxsplit=6) == ['', '', 'a', 'b', 'c', '', ''] ' a b c '.split(' ', keepempty=False) == ['a', 'b', 'c'] # New case ' a b c '.split(' ', maxsplit=0, keepempty=False) == [' a b c '] # New case ' a b c '.split(' ', maxsplit=1, keepempty=False) == ['a', 'b c '] # New case ' a b c '.split(' ', maxsplit=2, keepempty=False) == ['a', 'b', 'c '] # New case ' a b c '.split(' ', maxsplit=3, keepempty=False) == ['a', 'b', 'c', ' '] # New case ' a b c '.split(' ', maxsplit=4, keepempty=False) == ['a', 'b', 'c'] # New case ---------- nosy: +Mark.Bell _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue28937> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com