Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Chapman Flack Thu, 04 Mar 2021 08:41:50 -0800

On 03/04/21 10:40, Tom Lane wrote:
> Also, did you see
> 
> https://www.postgresql.org/message-id/fc160ee0-c843-b024-29bb-97b5da61971f%40darold.net
> 
> Seems like there may be some overlap in these proposals.


Not only that, the functions in that other proposal are very similar
to the standard's own functions that are specified to use XML Query
regular expression syntax (sample implementations in [1]).

These differently-named (which is good) functions seem to be a de facto
standard where the regexp syntax and semantics are those native to the
DBMS, the correspondence being

    de facto                      ISO XQuery-based
  --------------                 ------------------
  regexp_like                    like_regex
  regexp_count                   occurrences_regex
  regexp_instr                   position_regex
  regexp_substr                  substring_regex
  regexp_replace                 translate_regex


The regexp_positions proposal highlights an interesting apparent gap in
both the de facto and the ISO specs: the provided functions allow you
to specify which occurrence you're talking about, and get the corresponding
positions or the corresponding substring, but neither set of functions
includes one to just give you all the matching positions at once as
a SETOF something.

What the proposed regexp_positions() returns is pretty much exactly
the notional "list of match vectors" that appears internally throughout
the specs of the ISO functions, but is never directly exposed.

In the LOMV as described in the standard, the position/length arrays
are indexed from zero, and the start and length at index 0 are those
for the overall match as a whole.

Right now, if you have a query that involves, say,

  substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 1) and also
  substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 2),

a naïve implementation like [1] will of course compile and evaluate
the regexp twice and return one group each time. It makes me wonder
whether the standards committee was picturing a clever parse analyzer
and planner that would say "aha! you want group 1 and group 2 from
a single evaluation of this regex!", and that might even explain the
curious rule in the standard that the regex must be an actual literal,
not any other expression. (Still, that strikes me as an awkward way to
have to write it, spelling the regex out as a literal, twice.)

It has also made my idly wonder how close we could get to behaving
that way, perhaps with planner support functions and other available
parse analysis/planning hooks. Would any of those mechanisms get a
sufficiently global view of the query to do that kind of rewriting?

Regards,
-Chap


[1]
https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Reply via email to