On 8 Apr, 19:49, gry <georgeryo...@gmail.com> wrote: > [ python3.1.1, re.__version__='2.2.1' ] > I'm trying to use re to split a string into (any number of) pieces of > these kinds: > 1) contiguous runs of letters > 2) contiguous runs of digits > 3) single other characters > > e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain', > '.', 'in', '#', '=', 1234] > I tried:>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', > '555tHe-rain.in#=1234').groups() > > ('1234', 'in', '1234', '=') > > Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a > group? Is my regexp illegal somehow and confusing the engine? > > I *would* like to understand what's wrong with this regex, though if > someone has a neat other way to do the above task, I'm also interested > in suggestions.
Avoiding re's (for a bit of fun): (no good for unicode obviously) import string from itertools import groupby, chain, repeat, count, izip s = """555tHe-rain.in#=1234""" unique_group = count() lookup = dict( chain( izip(string.ascii_letters, repeat('L')), izip(string.digits, repeat('D')), izip(string.punctuation, unique_group) ) ) parse = dict(D=int, L=str.capitalize) print [ parse.get(key, lambda L: L)(''.join(items)) for key, items in groupby(s, lambda L: lookup[L]) ] [555, 'The', '-', 'Rain', '.', 'In', '#', '=', 1234] Jon. -- http://mail.python.org/mailman/listinfo/python-list