[issue28937] str.split(): allow removing empty strings (when sep is not None)

Mark Bell Tue, 18 May 2021 06:14:42 -0700

Mark Bell <mark00b...@googlemail.com> added the comment:

So I have taken a look at the original patch that was provided and I have been 
able to update it so that it is compatible with the current release. I have 
also flipped the logic in the wrapping functions so that they take a 
`keepempty` flag (which is the opposite of the `prune` flag).


I had to make a few extra changes since there are now some extra catches in 
things like PyUnicode_Split which spot that if len(self) > len(sep) then they 
can just return [self]. However that now needs an extra test since that 
shortcut can only be used if len(self) > 0. You can find the code here: 
https://github.com/markcbell/cpython/tree/split-keepempty

However in exploring this, I'm not sure that this patch interacts correctly 
with maxsplit. For example, 
    '   x y z'.split(maxsplit=1, keepempty=True)
results in
    ['', '', 'x', 'y z']
since the first two empty strings items are "free" and don't count towards the 
maxsplit. I think the length of the result returned must be <= maxsplit + 1, is 
this right?

I'm about to rework the logic to avoid this, but before I go too far could 
someone double check my test cases to make sure that I have the correct idea 
about how this is supposed to work please. Only the 8 lines marked "New case" 
show new behaviour, all the other come from how string.split works currently. 
Of course the same patterns should apply to bytestrings and bytearrays.

    ''.split() == []
    ''.split(' ') == ['']
    ''.split(' ', keepempty=False) == []    # New case

    '  '.split(' ') == ['', '', '']
    '  '.split(' ', maxsplit=1) == ['', ' ']
    '  '.split(' ', maxsplit=1, keepempty=False) == [' ']    # New case

    '  a b c  '.split() == ['a', 'b', 'c']
    '  a b c  '.split(maxsplit=0) == ['a b c  ']
    '  a b c  '.split(maxsplit=1) == ['a', 'b c  ']

    '  a b c  '.split(' ') == ['', '', 'a', 'b', 'c', '', '']
    '  a b c  '.split(' ', maxsplit=0) == ['  a b c  ']
    '  a b c  '.split(' ', maxsplit=1) == ['', ' a b c  ']
    '  a b c  '.split(' ', maxsplit=2) == ['', '', 'a b c  ']
    '  a b c  '.split(' ', maxsplit=3) == ['', '', 'a', 'b c  ']
    '  a b c  '.split(' ', maxsplit=4) == ['', '', 'a', 'b', 'c  ']
    '  a b c  '.split(' ', maxsplit=5) == ['', '', 'a', 'b', 'c', ' ']
    '  a b c  '.split(' ', maxsplit=6) == ['', '', 'a', 'b', 'c', '', '']

    '  a b c  '.split(' ', keepempty=False) == ['a', 'b', 'c']    # New case
    '  a b c  '.split(' ', maxsplit=0, keepempty=False) == ['  a b c  ']    # 
New case
    '  a b c  '.split(' ', maxsplit=1, keepempty=False) == ['a', 'b c  ']    # 
New case
    '  a b c  '.split(' ', maxsplit=2, keepempty=False) == ['a', 'b', 'c  ']   
 # New case
    '  a b c  '.split(' ', maxsplit=3, keepempty=False) == ['a', 'b', 'c', ' 
']    # New case
    '  a b c  '.split(' ', maxsplit=4, keepempty=False) == ['a', 'b', 'c']    
# New case

----------
nosy: +Mark.Bell

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue28937>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue28937] str.split(): allow removing empty strings (when sep is not None)

Reply via email to