Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Joel Jacobson Tue, 09 Mar 2021 11:31:30 -0800

On Tue, Mar 9, 2021, at 17:42, Tom Lane wrote:
> "Joel Jacobson" <j...@compiler.org> writes:
> > Tom - can you please give details on your unpleasant experiences with 
> > parallel arrays?
> 
> The problems I can recall running into were basically down to not having
> an easy way to iterate through parallel arrays.  There are ways to do
> that in SQL, certainly, but they all constrain how you write the query,
> and usually force ugly stuff like splitting it into sub-selects.


I see now what you mean, many thanks for explaining.

> 
> As an example, presuming that regexp_positions is defined along the
> lines of
> 
> regexp_positions(str text, pat text, out starts int[], out lengths int[])
> returns setof record

+1

I think this is the most feasible best option so far.

Attached is a patch implementing it this way.

I changed the start to begin at 1, since this is how position ( substring text 
IN string text ) → integer works.

SELECT * FROM regexp_positions('foobarbequebaz', '^', 'g');
starts | lengths
--------+---------
{1}    | {0}
(1 row)

SELECT * FROM regexp_positions('foobarbequebaz', 'ba.', 'g');
starts | lengths
--------+---------
{4}    | {3}
{12}   | {3}
(2 rows)

Mark's examples:

SELECT * FROM regexp_positions('foObARbEqUEbAz', $re$(?=beque)$re$, 'i');
starts | lengths
--------+---------
{7}    | {0}
(1 row)


SELECT * FROM regexp_positions('foobarbequebaz', '(?<=z)', 'g');
starts | lengths
--------+---------
{15}   | {0}
(1 row)

I've also tested your template queries:

> 
> then to actually get the identified substrings you'd have to do something
> like
> 
> select
>   substring([input string] from starts[i] for lengths[i])
> from
>   regexp_positions([input string], [pattern]) r,
>   lateral
>     generate_series(1, array_length(starts, 1)) i;

select
  substring('foobarbequebaz' from starts[i] for lengths[i])
from
  regexp_positions('foobarbequebaz', 'ba.', 'g') r,
  lateral
    generate_series(1, array_length(starts, 1)) i;

substring
-----------
bar
baz
(2 rows)

> I think the last time I confronted this, we didn't have multi-array
> UNNEST.  Now that we do, we can get rid of the generate_series(),
> but it's still not beautiful:
> 
> select
>   substring([input string] from s for l)
> from
>   regexp_positions([input string], [pattern]) r,
>   lateral
>     unnest(starts, lengths) u(s,l);

select
  substring('foobarbequebaz' from s for l)
from
  regexp_positions('foobarbequebaz', 'ba.', 'g') r,
  lateral
    unnest(starts, lengths) u(s,l);

substring
-----------
bar
baz
(2 rows)

> Having said that, the other alternative with a 2-D array:
> 
> regexp_positions(str text, pat text) returns setof int[]
> 
> seems to still need UNNEST, though now it's not the magic multi-array
> UNNEST but this slicing version:
> 
> select
>   substring([input string] from u[1] for u[2])
> from
>   regexp_positions([input string], [pattern]) r,
>   lateral
>     unnest_slice(r, 1) u;

Unable to test this one since there is no unnest_slice() (yet)

> 
> Anyway, I'd counsel trying to write out SQL implementations
> of regexp_matches() and other useful things based on any
> particular regexp_positions() API you might be thinking about.
> Can we do anything useful without a LATERAL UNNEST thingie?
> Are some of them more legible than others?

Hmm, I cannot think of a way.


/Joel

0004-regexp-positions.patch
Description: Binary data

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Reply via email to