On Tue, Sep 21, 2021, at 19:55, Tom Lane wrote: > "Joel Jacobson" <j...@compiler.org> writes: > > [ 0005-regexp-positions.patch ] > > I took a quick look at this patch. I am on board now with the general > idea of returning match positions, but I think there are still > definitional issues to be discussed. > > 1. The main idea we might perhaps want to adopt from the Oracle-ish > regexp functions is a search-start-position argument. I'm not sure > that this is exciting --- if we intend this function to be a close > analog of regexp_matches(), it'd be better to leave that off. But > it deserves explicit consideration.
I intentionally made it a close analog of regexp_matches(), to make it easy for existing users of regexp_matches() to understand how regexp_positions() works. > 2. It looks like you modeled this on regexp_matches() to the extent > of returning a set and using the 'g' flag to control whether the > set can actually contain more than one row. That's pretty ancient > precedent ... but we have revisited that design more than once > because regexp_matches() is just too much of a pain to use. I think > if we're going to do this, we should learn from history, and provide an > analog to regexp_match() as well as regexp_matches() right off the bat. Yes, I modeled it on regexp_matches(). OK, so what you are suggesting is we should add both a regexp_position() function, that would work like regexp_match() but return the position instead, in addition to the already suggested regexp_positions() function. That sounds like a good idea to me. > 3. The API convention you chose (separate start and length arrays) > is perhaps still confusing. When I first looked at the test case > > +SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$); > + regexp_positions > +------------------- > + ("{4,7}","{3,5}") > +(1 row) > > I thought it was all wrong because it seemed to be identifying > the substrings 'barbequ' and 'obarb'. If there'd been a different > number of matches than capture groups, maybe I'd not have been > confused, but still... I wonder if we'd be better advised to make > N capture groups produce N two-element arrays, or else mash it all > into one array of N*2 elements. But this probably depends on which > way is the easiest to work with in SQL. The drawbacks of two-element arrays have already been discussed up thread. Personally, I prefer the version suggested in the latest patch, and suggest we stick to it. > 4. Not sure about the handling of sub-matches. > There are various plausible definitions we could use: > > * We return the position/length of the overall match, never mind > about any parenthesized subexpressions. This is simple but I think > it loses significant functionality. As an example, you might have > a pattern like 'ab*(c*)d+' where what you actually want to know > is where the 'c's are, but they have to be in the context described > by the rest of the regexp. Without subexpression-match capability > that's painful to do. > > * If there's a capturing subexpression, return the position/length > of the first such subexpression, else return the overall match. > This matches the behavior of substring(). > > * If there are capturing subexpression(s), return the > positions/lengths of those, otherwise return the overall match. > This agrees with the behavior of regexp_match(es), so I'd tend > to lean to this option, but perhaps it's the hardest to use. > > * Return the position/length of the overall match *and* those of > each capturing subexpression. This is the most flexible choice, > but I give it low marks since it matches no existing behaviors. The existing behaviour of regexp_match(es) is perhaps a bit surprising to new users, but I think one quickly learn the semantics by trying out a few examples, and once understood it's at least not something that bothers me personally. I think it's best to let regexp_position(s) work the same way as regexp_match(es), since otherwise users would have to learn and remember two different behaviours. > As for comments on the patch itself: > > * The documentation includes an extraneous entry for regexp_replace, > and it fails to add the promised paragraph to functions-posix-regexp. > > * This bit is evidently copied from regexp_matches: > > + /* be sure to copy the input string into the multi-call ctx */ > + matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern, > > regexp_matches needs to save the input string so that > build_regexp_match_result can copy parts of that. But > regexp_positions has no such need AFAICS, so I think we > don't need a long-lived copy of the string. > > * I wouldn't use these names in build_regexp_positions_result: > > + ArrayType *so_ary; > + ArrayType *eo_ary; > > "so_ary" isn't awful, but it invites confusion with regex's "so" > field, which hasn't got the same semantics (off by one). > "eo_ary" is pretty bad because it isn't an "end offset" at all, but > a length. I'd go for "start_ary" and "length_ary" or some such. > > * Test cases seem a bit thin, notably there's no coverage of the > null-subexpression code path. Thanks for the comments on the patch itself. I will work on fixing these, but perhaps we can first see if it's possible to reach a consensus on the API convention and behaviour. /Joel