tag 22606 notabug thanks On 02/08/2016 06:28 PM, Chris Calabro wrote:
Thanks for the report. However, the behavior you see is intentional. It helps to read POSIX on how grep uses Basic Regular Expressions: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html In particular: "This section uses the term "invalid" for certain constructs or conditions. Invalid REs shall cause the utility or function using the RE to generate an error condition. When invalid is not used, violations of the specified syntax or semantics for REs produce undefined results: this may entail an error, enabling an extended syntax for that RE, or using the construct in error as literal characters to be matched." > (1) grep '?' # matches literal ? > # (i would expect a parse error, but whatever) Well-defined behavior per POSIX. '?' is an ordinary character in BRE, which means it MUST match a literal '?'. No parse error is possible. (But if this were an ERE, where ? is a metacharacter but appears at the start of the expression, this would be undefined behavior). > (2) grep '\?' # also matches literal ? Undefined, but not invalid, behavior per POSIX, because POSIX says \ in BRE is only well-defined when immediately before a metacharacter, but ? is not a metacharacter in BRE. So we can make it do whatever we want. We defined BRE "backslash-question" to mean "behave like ERE question" - but ERE question is undefined at the start of the expression, so we have yet another choice: report it as a syntax error, or match a literal "?" instead. As you can see, we match the literal "?" instead. > (3) grep ' \?' # matches everything, but given that the last > # expression matches a literal ?, i would expect this > # to match a space followed by a literal ? Undefined, but not invalid, behavior per POSIX. As in (2), we make it behave like ERE "question", which means this regular expression now matches 0 or 1 instance of space, and since 0 spaces can be matched on anything, the overall expression matches everything (well, insofar as there are no encoding errors to mess up the definition of "everything"). > > notice that * does not have the same behavior: > (4) grep '*' # matches literal * > # (i would expect a parse error, but whatever) Well-defined per POSIX, where it MUST match a literal "*" when used as the first character of the BRE. (But if this were an ERE, it would be undefined behavior to start the expression with *). > (5) grep '\*' # matches literal * Well-defined per POSIX, where it MUST match a literal "*" (since the backslash says to treat the * as an ordinary character instead of its usual metacharacter meaning). > (6) grep ' \*' # matches space followed by literal * Well-defined per POSIX, where it MUST match the two-character sequence space then star. > > cases (3) and (6) behave differently. Well, yeah, because ? and * are not identical in BRE - one is a literal character unless you use backslash to escape it into our extension of behaving like a metacharacter; the other is a metacharacter unless you use backslash to escape it into an ordinary character. > imo (6) looks reasonable, but (3) > does not. could someone comment on whether this is working as intended, > and if so, what is the rationale? The rationale is history. POSIX standardized Basic Regular Expressions based on existing practice; and the original implementation of grep did NOT support ? as a special character at all. Later on, other programs invented Extended Regular Expressions, and gave ? a special meaning, then even later, people realized that ? was useful, but since BRE were already baked in as '?' matching literally, we had to invent '\?' as the extension to use it as a metacharacter. If you build a time machine and could go back 40 years to invent regular expressions from scratch, please have the decency to invent just ONE syntax, not 20 disparate flavors (of which the two most popular become POSIX BRE and ERE, with weird rules on what is valid where). But since this behavior is intentional and required by POSIX, I'm closing this as not a bug. Feel free to reply to the thread with further questions, though. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature