OK. I've been trying to get my mind around this, and think about ways to improve the documentation (more about that below). I'm pretty sure that I can see the general concept now, and am almost convinced that it really does work as described. I guess I really don't like the whole RE inheriting it's preference from the first preference it encounters. It seems like it would be better to use a switch preceeding the whole expression, as is done with (?i) for case-insensitivity. Designating (?g) for greedy and (?G) for non-greedy seems like it would be less inviting of human error. But that's mostly a quibble, as long as it works consistently as described!

I've still got a few problems, though. Per 9.6.3.5, "An RE consisting of two or more branches connected by the | operator prefers longest match."

Apply that to this example:

   SELECT substring('abc' FROM '.*?|.*?');

which returns a greedy 'abc'. In this case, the whole RE is then supposed to be greedy (because of the "|"). So I suppose that _might_ be said to work as described, even though the | in this case overrides the expressed preference of both of its components (in which case I'm just griping about not liking the way this was implemented. :) )

But what about these two queries:

   SELECT substring('a' FROM 'a?|a?');

This returns a greedy 'a', similar to the example above.  But then why does

   SELECT substring('ba' FROM 'a?|a?');

return a non-greedy empty string? Using the logic from the previous examples, this should do a greedy match and return 'a' as well. If this isn't a bug, please explain why!

With regard to the documentation, after re-reading it many times I'd have to say the information is all there, but it's hard to absorb. I think the main problem is that the term "preference" is used to discuss greedy/non-greediness, as well as the words greedy & non-greedy. This makes it easier for humans to fail to connect the dots. People are more likely to be familiar with greedy & non-greedy (especially those who've used other regex's before), whereas preference isn't as clear. Perhaps the term "greediness preference" could be used in place of "preference", or "preference" could be dropped altogether.

In general, I think "prefers shortest match" could be replaced with "is non-greedy", and "prefers longest match" could be replaced with "is greedy". A little bit more context might be helpful as well, as in example c) below.

As an example, here's a couple of different possibilities for the second sentence of the section:

a) If the RE could match more than one substring starting at that point, its choice is determined by its greediness /preference/: either the longest substring (greedy), or the shortest (non-greedy).

b) If the RE could match more than one substring starting at that point, the match can be either greedy (matching the longest substring) or non-greedy (matching the shortest substring). Whether an RE is greedy or not is determined by the following rules...

c) Like individual components of an RE, the entire RE can be either greedy (matching the longest substring) or non-greedy (matching the shortest substring).

Do you think an edit along these lines would be helpful? If so, I'd be willing to take a shot at re-writing that section. Let me know. Thanks.

Ken Tanzer




Tom Lane wrote:

Ken Tanzer <[EMAIL PROTECTED]> writes:


Thanks for the quick responses yesterday. At a minimum, it seems like this behavior does not match what is described in the Postgres documentation (more detail below).



After looking at this more, I think that it is actually behaving as Spencer designed it to. The key point is this bit from the fine print in section 9.6.3.5:

        A branch has the same preference as the first quantified atom in it
        which has a preference.

("branch" being any regexp with no outer-level | operator)

What this apparently means is that if the RE begins with a non-greedy
quantifier, then the matching will be done in such a way that the whole
RE matches the shortest possible string --- that is, the whole RE is
non-greedy.  It's still possible for individual items within the RE to
be greedy or non-greedy, but that only affects how much of the shortest
possible total match they are allowed to eat relative to each other.
All the examples I've looked at seem to work "properly" when seen in
this light.

I can see that this behavior could have some usefulness, and if need be
you can always override it by writing (...){1,1} around the whole RE.
So at this point I'm disinclined to vary from the Tcl semantics.

This does leave us with a documentation problem though, because this
behavior is surely not obvious from what it says in 9.6.3.5.  If you've
got any thoughts about a better explanation, I'm all ears.



Here's the actual regex we're working on--any help reformulating this would be great!





select substring('Searching for log 5376, referenced in this text'
FROM
'(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />| )(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)');



I don't see that you need either the leading (?:.*?) or the trailing (.*?) here, and if you dropped them then the first quantifier would be the "s?" which is greedy so the curious case goes away. I suppose the idea of adding (?:.*?) was to ensure that "log" will be matched to the first possible place where it could match --- but that is true anyway, per the first sentence of 9.6.3.5.

regards, tom lane




---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Reply via email to