John Nagle wrote:
The regular expression "split" behaves slightly differently than string
split:
>>> import re
>>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE)
>>> kresplit2.split(" HELLO THERE ")
['', 'HELLO', 'THERE', '']
>>> kresplit2.split("VERISIGN INC.")
['VERISIGN', 'INC', '']
I'd thought that "split" would never produce an empty string, but
it will.
The regular string split operation doesn't yield empty strings:
>>> " HELLO THERE ".split()
['HELLO', 'THERE']
Yes it does.
>>> " HELLO THERE ".split(" ")
['', '', '', 'HELLO', '', '', '', 'THERE', '', '', '']
If I try to get the functionality of string split with re:
>>> s2 = " HELLO THERE "
>>> kresplit4 = re.compile(r'\W+', re.UNICODE)
>>> kresplit4.split(s2)
['', 'HELLO', 'THERE', '']
I still get empty strings.
The documentation just describes re.split as "Split string by the
occurrences of pattern", which is not too helpful.
It's the plain str.split() which is unusual in that:
1. it splits on sequences of whitespace instead of one per occurrence;
2. it discards leading and trailing sequences of whitespace.
Compare:
>>> " A B ".split(" ")
['', '', 'A', '', 'B', '', '']
with:
>>> " A B ".split()
['A', 'B']
It just happens that the unusual one is the most commonly used one, if
you see what I mean! :-)
--
http://mail.python.org/mailman/listinfo/python-list