Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Joel Jacobson Mon, 08 Mar 2021 09:05:58 -0800

On Mon, Mar 8, 2021, at 17:20, Mark Dilger wrote:
> > On Mar 5, 2021, at 11:46 AM, Joel Jacobson <j...@compiler.org> wrote:
> > <range.sql><0003-regexp-positions.patch>
> 
> I did a bit more testing:
> 
> +SELECT regexp_positions('foobarbequebaz', 'b', 'g');
> + regexp_positions 
> +------------------
> + {"[3,5)"}
> + {"[6,8)"}
> + {"[11,13)"}
> +(3 rows)
> +
> 
> I understand that these ranges are intended to be read as one character long 
> matches starting at positions 3, 6, and 11, but they look like they match 
> either two or three characters, depending on how you read them, and users 
> will likely be confused by that.
> 
> +SELECT regexp_positions('foobarbequebaz', '(?=beque)', 'g');
> + regexp_positions 
> +------------------
> + {"[6,7)"}
> +(1 row)
> +
> 
> This is a zero length match.  As above, it might be confusing that a zero 
> length match reads this way.
> 
> +SELECT regexp_positions('foobarbequebaz', '(?<=z)', 'g');
> + regexp_positions 
> +------------------
> + {"[14,15)"}
> +(1 row)
> +
> 
> Same here, except this time position 15 is referenced, which is beyond the 
> end of the string.
> 
> I think a zero length match at the end of this string should read as 
> {"[14,14)"}, and you have been forced to add one to avoid that collapsing 
> down to "empty", but I'd rather you found a different datatype rather than 
> abuse the definition of int4range.
> 
> It seems that you may have reached a similar conclusion down-thread?


This is due to the, in my opinion, unfortunate decision of using 
inclusive/exclusive as the canonical form for discrete types.
Probably not much we can do about that, but that's what we have, so I think 
it's fine.

[6,7) is exactly the same thing as [6,6] for discrete types, it simply means 
the startpos and endpos both are 6.

I prefer to think of a match as two points. If the points are at the same 
position, it's a zero length match.

In the example, the startpos and endpos are both at 6, so it's a zero length 
match.,

This was very confusing to me at first. I wrongly thought I needed an empty 
int4range and had the perverse idea of hacking the range type to allow setting 
lower and upper even though it was empty. This was a really really bad idea 
which I feel stupid of even considering. It was before I understood a zero 
length match should actually *not* be represented as an empty int4range, but as 
an int4range covering exactly one single integer, since startpos=endpos. This 
was due to my off-by-one error. With that fixed, the only problem is the (in my 
opinion) unnatural canonical form for discrete types, since in this context 
it's just silly to talk about inclusive/exclusive. I think inclusive/inclusive 
would have been much more SQL idiomatic, since that's the semantics for BETWEEN 
in SQL, it's inclusive/inclusive. So is most other programming environments 
I've seen.

However, not much we can do about that for int4range/int8range,
but maybe multirange could change the canonical form going forward.

Even if not changed, I think int4range works just fine. It just requires a bit 
more mental effort to understand what the values mean. Probably an annoyance 
for users at first, but I think they easily will understand they should just do 
"-1" on the upper() value (but only if upper_inc() is FALSE, but you know that 
for sure for int4ranges, so it is really necessary, one might wonder).

If a N+1 dimension array could easily be unnested to a N dimension array,
I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and not 
controversial.

Since there are currently zero composite type returning catalog functions, I 
can see why the idea of returning a "range" with two "start" and "stop" fields 
is controversial. There are probably good reasons that I fail to see why there 
are no composite type returning functions in the catalogs. Ideas on why this is 
the case, anyone?

/Joel

Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]

Reply via email to