Thomas 'PointedEars' Lahn wrote: > Jason Bailey wrote: >> shared-network My-Network-MOHE { >> […] { >> >> I compile my regex: >> m = re.compile(r"^(shared\-network (" + re.escape(shared_network) + r") >> \{((\n|.|\r\n)*?)(^\}))", re.MULTILINE|re.UNICODE) > > This code does not run as posted. Applying Occam’s Razor, I think you > meant to post > > m = re.compile(r"^(shared\-network (" > + re.escape(shared_network) > + r") \{((\n|.|\r\n)*?)(^\}))", re.MULTILINE|re.UNICODE) > > […] > You get no matches because you have escaped the HYPHEN-MINUSes (“-”). You > never need to escape those characters, in fact you must not do that here > because r'\-' is not an (unnecessarily) escaped HYPHEN-MINUS, it is a > literal backslash followed by a HYPHEN-MINUS, a character sequence that > does not occur in your string. Outside of a character class you do not > need to do that, and in a character class you can put it as first or last > character instead (“[-…]” or “[…-]”). > > You have escaped the first HYPHEN-MINUS; re.escape() has escaped the other > two for you: > > | >>> re.escape('-') > | '\\-' > > I presume this behavior is because of character classes, and the idea that > the return value should work at any position in a character class.
It would appear that while my answer is not entirely wrong, the first sentence of that section is. You may escape the HYPHEN-MINUS there, and may use re.escape(); it has no effect on the expression because of what I said following that sentence. One must consider that the string is first parsed by Python’s string parser and then by Python’s re parser. So I have presently no specific idea why you get no matches, however r'\{((\n|.|\r\n)*?)(^\}' is not a proper way to match matching braces and everything in-between. To begin with, the proper expression to match any newline is r'(\r?\n|\r)' because the first matching alternative in an alternation, not the longest one, wins. But if you specify re.DOTALL, you can simply use “.” for any character (including any newline combination). > […] > You should refrain from parsing non-regular languages with a *single* > regular expression (multiple expressions or expressions with alternation > in a loop are usually fine; this can be used for building efficient > parsers), even though Python’s regular expressions, which are not an > exception there, > are not exactly “regular” in the theoretical computer science sense. See > the Chomsky hierarchy and Jeffrey E. F. Friedl’s insightful textbook > “Mastering Regular Expressions”. And for matching matching braces (sic!) with regular expressions, you need a recursive one (which is another extension of regular expressions as they are discussed in CS). Or a parser in the first place. Otherwise you match too much with greedy matching { { } } { { } } ^-------------^ or too little with non-greedy matching { { } } { { } } ^---^ CS regular expressions can be used to describe *regular* languages (Chomsky- type 3). Bracket languages are, in general, not regular (see “pumping lemma for regular languages”), so for them you need an PDA¹-like extension of CS regular expressions (the aforementioned recursive ones), or a PDA implementation in the first place. Such a PDA implementation is part of a parser. ____ ¹ <https://en.wikipedia.org/wiki/Pushdown_automaton> -- PointedEars Twitter: @PointedEars2 Please do not cc me. / Bitte keine Kopien per E-Mail. -- https://mail.python.org/mailman/listinfo/python-list