On 03/04/21 10:40, Tom Lane wrote: > Also, did you see > > https://www.postgresql.org/message-id/fc160ee0-c843-b024-29bb-97b5da61971f%40darold.net > > Seems like there may be some overlap in these proposals.
Not only that, the functions in that other proposal are very similar to the standard's own functions that are specified to use XML Query regular expression syntax (sample implementations in [1]). These differently-named (which is good) functions seem to be a de facto standard where the regexp syntax and semantics are those native to the DBMS, the correspondence being de facto ISO XQuery-based -------------- ------------------ regexp_like like_regex regexp_count occurrences_regex regexp_instr position_regex regexp_substr substring_regex regexp_replace translate_regex The regexp_positions proposal highlights an interesting apparent gap in both the de facto and the ISO specs: the provided functions allow you to specify which occurrence you're talking about, and get the corresponding positions or the corresponding substring, but neither set of functions includes one to just give you all the matching positions at once as a SETOF something. What the proposed regexp_positions() returns is pretty much exactly the notional "list of match vectors" that appears internally throughout the specs of the ISO functions, but is never directly exposed. In the LOMV as described in the standard, the position/length arrays are indexed from zero, and the start and length at index 0 are those for the overall match as a whole. Right now, if you have a query that involves, say, substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 1) and also substring_regex('(b[^b]+)(b[^b]+)' IN str GROUP 2), a naïve implementation like [1] will of course compile and evaluate the regexp twice and return one group each time. It makes me wonder whether the standards committee was picturing a clever parse analyzer and planner that would say "aha! you want group 1 and group 2 from a single evaluation of this regex!", and that might even explain the curious rule in the standard that the regex must be an actual literal, not any other expression. (Still, that strikes me as an awkward way to have to write it, spelling the regex out as a literal, twice.) It has also made my idly wonder how close we could get to behaving that way, perhaps with planner support functions and other available parse analysis/planning hooks. Would any of those mechanisms get a sufficiently global view of the query to do that kind of rewriting? Regards, -Chap [1] https://tada.github.io/pljava/pljava-examples/apidocs/org/postgresql/pljava/example/saxon/S9.html#method.summary