On 05/13/12 16:14, Massi wrote: > Hi everyone, > I know this question has been asked thousands of times, but in my case > I have an additional requirement to be satisfied. I need to handle > substrings in the form 'string with spaces':'another string with > spaces' as a single token; I mean, if I have this string: > > s ="This is a 'simple test':'string which' shows 'exactly my' > problem" > > I need to split it as follow (the single quotes must be mantained in > the splitted list):
The "quotes must be maintained" bit is what makes this different from most common use-cases. Without that condition, using shlex.split() from the standard library does everything else that you need. Alternatively, one might try hacking csv.reader() to do the splitting for you, though I had less luck than with shlex. > Up to know I have written some ugly code which uses regular > expression: > > splitter = re.compile("(?=\s|^)('[^']+') | ('[^']+')(?=\s|$)") You might try r = re.compile(r"""(?:'[^']*'|"[^"]*"|[^'" ]+)+""") print r.findall(s) which seems to match your desired output. It doesn't currently handle tabs, but by breaking it out, it's easy to modify (and may help understand what it's doing) >>> single_quoted = "'[^']*'" >>> double_quoted = '"[^"]*"' >>> other = """[^'" \t]+""" # added a "\t" tab here >>> matches = '|'.join((single_quoted, double_quoted, other)) >>> regex = r'(?:%s)+' % matches >>> r = re.compile(regex) >>> r.findall(s) ['This', 'is', 'a', "'simple test':'string which'", 'shows', "'exactly my'", 'problem'] Hope this helps, -tkc -- http://mail.python.org/mailman/listinfo/python-list