On Mon, Mar 8, 2021, at 17:20, Mark Dilger wrote: > > On Mar 5, 2021, at 11:46 AM, Joel Jacobson <j...@compiler.org> wrote: > > <range.sql><0003-regexp-positions.patch> > > I did a bit more testing: > > +SELECT regexp_positions('foobarbequebaz', 'b', 'g'); > + regexp_positions > +------------------ > + {"[3,5)"} > + {"[6,8)"} > + {"[11,13)"} > +(3 rows) > + > > I understand that these ranges are intended to be read as one character long > matches starting at positions 3, 6, and 11, but they look like they match > either two or three characters, depending on how you read them, and users > will likely be confused by that. > > +SELECT regexp_positions('foobarbequebaz', '(?=beque)', 'g'); > + regexp_positions > +------------------ > + {"[6,7)"} > +(1 row) > + > > This is a zero length match. As above, it might be confusing that a zero > length match reads this way. > > +SELECT regexp_positions('foobarbequebaz', '(?<=z)', 'g'); > + regexp_positions > +------------------ > + {"[14,15)"} > +(1 row) > + > > Same here, except this time position 15 is referenced, which is beyond the > end of the string. > > I think a zero length match at the end of this string should read as > {"[14,14)"}, and you have been forced to add one to avoid that collapsing > down to "empty", but I'd rather you found a different datatype rather than > abuse the definition of int4range. > > It seems that you may have reached a similar conclusion down-thread?
This is due to the, in my opinion, unfortunate decision of using inclusive/exclusive as the canonical form for discrete types. Probably not much we can do about that, but that's what we have, so I think it's fine. [6,7) is exactly the same thing as [6,6] for discrete types, it simply means the startpos and endpos both are 6. I prefer to think of a match as two points. If the points are at the same position, it's a zero length match. In the example, the startpos and endpos are both at 6, so it's a zero length match., This was very confusing to me at first. I wrongly thought I needed an empty int4range and had the perverse idea of hacking the range type to allow setting lower and upper even though it was empty. This was a really really bad idea which I feel stupid of even considering. It was before I understood a zero length match should actually *not* be represented as an empty int4range, but as an int4range covering exactly one single integer, since startpos=endpos. This was due to my off-by-one error. With that fixed, the only problem is the (in my opinion) unnatural canonical form for discrete types, since in this context it's just silly to talk about inclusive/exclusive. I think inclusive/inclusive would have been much more SQL idiomatic, since that's the semantics for BETWEEN in SQL, it's inclusive/inclusive. So is most other programming environments I've seen. However, not much we can do about that for int4range/int8range, but maybe multirange could change the canonical form going forward. Even if not changed, I think int4range works just fine. It just requires a bit more mental effort to understand what the values mean. Probably an annoyance for users at first, but I think they easily will understand they should just do "-1" on the upper() value (but only if upper_inc() is FALSE, but you know that for sure for int4ranges, so it is really necessary, one might wonder). If a N+1 dimension array could easily be unnested to a N dimension array, I would prefer Tom's idea of a 2-D regexp_positions(), since it simple and not controversial. Since there are currently zero composite type returning catalog functions, I can see why the idea of returning a "range" with two "start" and "stop" fields is controversial. There are probably good reasons that I fail to see why there are no composite type returning functions in the catalogs. Ideas on why this is the case, anyone? /Joel