Re: Behavior of re.split on empty strings is unexpected

MRAB Mon, 02 Aug 2010 11:05:40 -0700

John Nagle wrote:

The regular expression "split" behaves slightly differently than stringsplit:


 >>> import re

>>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE)

 >>> kresplit2.split("   HELLO    THERE   ")
['', 'HELLO', 'THERE', '']

 >>> kresplit2.split("VERISIGN INC.")
['VERISIGN', 'INC', '']

I'd thought that "split" would never produce an empty string, but
it will.

The regular string split operation doesn't yield empty strings:

 >>> "   HELLO   THERE ".split()
['HELLO', 'THERE']

Yes it does.

>>> "   HELLO    THERE   ".split(" ")
['', '', '', 'HELLO', '', '', '', 'THERE', '', '', '']

If I try to get the functionality of string split with re:

 >>> s2 = "   HELLO   THERE  "
 >>> kresplit4 = re.compile(r'\W+', re.UNICODE)
 >>> kresplit4.split(s2)
['', 'HELLO', 'THERE', '']

I still get empty strings.
The documentation just describes re.split as "Split string by theoccurrences of pattern", which is not too helpful.

It's the plain str.split() which is unusual in that:

1. it splits on sequences of whitespace instead of one per occurrence;

2. it discards leading and trailing sequences of whitespace.

Compare:

>>> "  A  B  ".split(" ")
['', '', 'A', '', 'B', '', '']

with:

>>> "  A  B  ".split()
['A', 'B']

It just happens that the unusual one is the most commonly used one, if
you see what I mean! :-)
--
http://mail.python.org/mailman/listinfo/python-list

Re: Behavior of re.split on empty strings is unexpected

Reply via email to