On Thu, Feb 19, 2009 at 12:55 PM, Ron Garret <rnospa...@flownet.com> wrote: > I'm trying to split a CamelCase string into its constituent components. > This kind of works: > >>>> re.split('[a-z][A-Z]', 'fooBarBaz') > ['fo', 'a', 'az'] > > but it consumes the boundary characters. To fix this I tried using > lookahead and lookbehind patterns instead, but it doesn't work: > >>>> re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz') > ['fooBarBaz'] > > However, it does seem to work with findall: > >>>> re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz') > ['', ''] > > So the regular expression seems to be doing the Right Thing. Is this a > bug in re.split, or am I missing something?
>From what I can tell, re.split can't split on zero-length boundaries. It needs something to split on, like str.split. Is this a bug? Possibly. The docs for re.split say: Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings. Note that it does not say that zero-length matches won't work. I can work around the problem thusly: re.sub(r'(?<=[a-z])(?=[A-Z])', '_', 'fooBarBaz').split('_') Which is ugly. I reckon you can use re.findall with a pattern that matches the components and not the boundaries, but you have to take care of the beginning and end as special cases. Kurt -- http://mail.python.org/mailman/listinfo/python-list