It's me wrote:
Well, they are a pain to more than just you, and the conventional advice is "even when you are convinced you need to use REs, try and find another way".I am never very good with regular expressions. My head always hurts whenever I need to use it.
Well, you haven't yet convinced me that you *have* to. Personally, I think you just like trouble :-)I need to read a data file and parse each data record. Each item on the data record begins with either a string, or a list of strings. I searched around and didn't see any existing Python packages that does that. scanf.py, for instance, can do standard items but doesn't know about list. So, I figure I might have to write a lex engine for it and of course I have to deal wit RE again.
But I run into problem right from the start. To recognize a list, I need a RE for the string:
1) begin with [" (left bracket followed by a double quote with zero or more spaces in between) 2) followed by any characters until ] but only if that left bracket is not preceeded by the escape character \.
So the pattern is
1. If the line begins with a "[" it should end with a "]"
2. Otherwise, it shouldn't?
I'm trying to gently point out that the syntax you want to accept isn't actually very clear. If the format is "Python strings and lists of strings" then you might want to use the Python lexer to parse them, but that's quite an advanced topic. [too advanced for me :-]
The problem is matching "up to a right bracket not preceded by a backslash". This seems to require what's technically referred to as a "negative lookbehind assertion" - in other words, a pattern that doesn't match anything, but checks that a specific condition is false or fails.
So, I tried:
^\[[" "]*"[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \"]*]
and tested with:
["This line\] works"]
but it fails with:
["This line fails"]
I would have thought that:
(\\\])*
should work because it's zero or more incidence of the pattern \]
Any help is greatly appreciated.
Sorry for beign OT. I posted this question at the lex group and didn't get any response. I figure may be somebody would know around here.
I'd start with baby steps. First of all, make sure that you can match the individual strings. Then use that pattern, parenthesized to turn it into a group, as a component in a more complex pattern.
Do you want to treat "this is also \" a string" as an allowable string? In that case you need a pattern that matches 'up to the first quotation mark not preceded by a backslash" as well!
Let's try matching a single string first:
>>> s = re.compile(r'(".*?(?<!\\)")') >>> s.match('"s1", "s2"').groups() ('"s1"',)
Note that I followed the "*" with a "?" to stop it being greedy, and matching as many characters as it could. OK, does that work when we have escaped quotation marks?
>>> s.match(r'"s1\"\"", "s2"').groups() ('"s1\\"\\""',)
Apparently so. The negative lookbehind assertion stops a quote from matching when it's preceded by a backslash. Can we match a comma-separated list of such strings?
>>> slpat = r'(".*?(?<!\\)")(?:, (".*?(?<!\\)"))*' >>> s = re.compile(slpat)
This is a bit trickier: here the second grouping beginning with "(?:" is intended to ensure that only the strings that get matched are included in the groups, not the separators, even though they must be grouped together. The list *must* be separated by ", ", but you could alter the pattern to allow zero or more whitespace characters.
>>> s.match(r'"s1\"\"", "s2"').groups() ('"s1\\"\\""', '"s2"')
Well, that seems to work. Note that these patterns all ignore bracket characters, so all you need to do now is to surround them with patterns to match the opening and closing brackets, and you're done (I hope).
Anyway, it'll give you a few ideas to work with.
regards Steve -- Steve Holden http://www.holdenweb.com/ Python Web Programming http://pydish.holdenweb.com/ Holden Web LLC +1 703 861 4237 +1 800 494 3119 -- http://mail.python.org/mailman/listinfo/python-list