On Apr 8, 1:49 pm, gry <georgeryo...@gmail.com> wrote: > [ python3.1.1, re.__version__='2.2.1' ] > I'm trying to use re to split a string into (any number of) pieces of > these kinds: > 1) contiguous runs of letters > 2) contiguous runs of digits > 3) single other characters > > e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain', > '.', 'in', '#', '=', 1234] > I tried:>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', > '555tHe-rain.in#=1234').groups() > > ('1234', 'in', '1234', '=') > > Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a > group? Is my regexp illegal somehow and confusing the engine? > > I *would* like to understand what's wrong with this regex, though if > someone has a neat other way to do the above task, I'm also interested > in suggestions.
IMO, for most purposes, for people who don't want to become re experts, the easiest, fastest, best, most predictable way to use re is re.split. You can either call re.split directly, or, if you are going to be splitting on the same pattern over and over, compile the pattern and grab its split method. Use a *single* capture group in the pattern, that covers the *whole* pattern. In the case of your example data: >>> import re >>> splitter=re.compile('([A-Za-z]+|[0-9]+|[-.#=])').split >>> s='555tHe-rain.in#=1234' >>> [x for x in splitter(s) if x] ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234'] The reason for the list comprehension is that re.split will always return a non-matching string between matches. Sometimes this is useful even when it is a null string (see recent discussion in the group about splitting digits out of a string), but if you don't care to see null (empty) strings, this comprehension will remove them. The reason for a single capture group that covers the whole pattern is that it is much easier to reason about the output. The split will give you all your data, in order, e.g. >>> ''.join(splitter(s)) == s True HTH, Pat -- http://mail.python.org/mailman/listinfo/python-list