Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Joel Jacobson Sat, 25 Dec 2021 03:03:18 -0800

On Tue, Sep 21, 2021, at 19:55, Tom Lane wrote:
> "Joel Jacobson" <j...@compiler.org> writes:
> > [ 0005-regexp-positions.patch ]
>
> I took a quick look at this patch.  I am on board now with the general
> idea of returning match positions, but I think there are still
> definitional issues to be discussed.
>
> 1. The main idea we might perhaps want to adopt from the Oracle-ish
> regexp functions is a search-start-position argument.  I'm not sure
> that this is exciting --- if we intend this function to be a close
> analog of regexp_matches(), it'd be better to leave that off.  But
> it deserves explicit consideration.


I intentionally made it a close analog of regexp_matches(),
to make it easy for existing users of regexp_matches() to
understand how regexp_positions() works.

> 2. It looks like you modeled this on regexp_matches() to the extent
> of returning a set and using the 'g' flag to control whether the
> set can actually contain more than one row.  That's pretty ancient
> precedent ... but we have revisited that design more than once
> because regexp_matches() is just too much of a pain to use.  I think
> if we're going to do this, we should learn from history, and provide an
> analog to regexp_match() as well as regexp_matches() right off the bat.

Yes, I modeled it on regexp_matches().
OK, so what you are suggesting is we should add both a regexp_position() 
function,
that would work like regexp_match() but return the position instead,
in addition to the already suggested regexp_positions() function.

That sounds like a good idea to me.

> 3. The API convention you chose (separate start and length arrays)
> is perhaps still confusing.  When I first looked at the test case
>
> +SELECT regexp_positions('foobarbequebaz', $re$(bar)(beque)$re$);
> + regexp_positions  
> +-------------------
> + ("{4,7}","{3,5}")
> +(1 row)
>
> I thought it was all wrong because it seemed to be identifying
> the substrings 'barbequ' and 'obarb'.  If there'd been a different
> number of matches than capture groups, maybe I'd not have been
> confused, but still...  I wonder if we'd be better advised to make
> N capture groups produce N two-element arrays, or else mash it all
> into one array of N*2 elements.  But this probably depends on which
> way is the easiest to work with in SQL.

The drawbacks of two-element arrays have already been discussed up thread.
Personally, I prefer the version suggested in the latest patch, and suggest we 
stick to it.

> 4. Not sure about the handling of sub-matches.
> There are various plausible definitions we could use:
>
> * We return the position/length of the overall match, never mind
> about any parenthesized subexpressions.  This is simple but I think
> it loses significant functionality.  As an example, you might have
> a pattern like 'ab*(c*)d+' where what you actually want to know
> is where the 'c's are, but they have to be in the context described
> by the rest of the regexp.  Without subexpression-match capability
> that's painful to do.
>
> * If there's a capturing subexpression, return the position/length
> of the first such subexpression, else return the overall match.
> This matches the behavior of substring().
>
> * If there are capturing subexpression(s), return the
> positions/lengths of those, otherwise return the overall match.
> This agrees with the behavior of regexp_match(es), so I'd tend
> to lean to this option, but perhaps it's the hardest to use.
>
> * Return the position/length of the overall match *and* those of
> each capturing subexpression.  This is the most flexible choice,
> but I give it low marks since it matches no existing behaviors.

The existing behaviour of regexp_match(es) is perhaps a bit surprising to new 
users,
but I think one quickly learn the semantics by trying out a few examples,
and once understood it's at least not something that bothers me personally.

I think it's best to let regexp_position(s) work the same way as 
regexp_match(es),
since otherwise users would have to learn and remember two different behaviours.

> As for comments on the patch itself:
>
> * The documentation includes an extraneous entry for regexp_replace,
> and it fails to add the promised paragraph to functions-posix-regexp.
>
> * This bit is evidently copied from regexp_matches:
>
> +        /* be sure to copy the input string into the multi-call ctx */
> +        matchctx = setup_regexp_matches(PG_GETARG_TEXT_P_COPY(0), pattern,
>
> regexp_matches needs to save the input string so that
> build_regexp_match_result can copy parts of that.  But
> regexp_positions has no such need AFAICS, so I think we
> don't need a long-lived copy of the string.
>
> * I wouldn't use these names in build_regexp_positions_result:
>
> +    ArrayType  *so_ary;
> +    ArrayType  *eo_ary;
>
> "so_ary" isn't awful, but it invites confusion with regex's "so"
> field, which hasn't got the same semantics (off by one).
> "eo_ary" is pretty bad because it isn't an "end offset" at all, but
> a length.  I'd go for "start_ary" and "length_ary" or some such.
>
> * Test cases seem a bit thin, notably there's no coverage of the
> null-subexpression code path.

Thanks for the comments on the patch itself.
I will work on fixing these, but perhaps we can first see if it's possible to 
reach a consensus on the API convention and behaviour.

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Reply via email to