Re: substring start position behavior
On Tue, Feb 27, 2024 at 05:20:23PM +, PG Doc comments form wrote: > The following documentation comment has been logged on the website: > > Page: https://www.postgresql.org/docs/16/functions-string.html > Description: > > Hey, > > I was confused by substring behavior today, when giving 0 as a start > position. I understand now that string indices are 1-based, have a certain > flexibility about where to start (allowing negative start positions), and > that this is defined in the standard SQL spec. > > I'm comfy with all this, but I think it'd be nice to have a hint in the pg > substring docs for nonpositive start positions, so that users don't have to > have paid for the standard SQL spec to get past this. To me, substring seems > like a relatively common function with relatively surprising behavior. I dug into this and quickly became as confused as you were. The best explanation I found of the current behavior is here (with diagram): https://www.mssqltips.com/sqlservertutorial/9374/sql-substring-function/ SELECT SUBSTRING('Hello world',-2,5) as msg The last Postgres community discussion of this behavior I could find was from 2007: https://www.postgresql.org/message-id/flat/12803.1168804636%40sss.pgh.pa.us#8316fb2298c9e49f77867a1ae2ead447 This web page explains the feature: https://stackoverflow.com/questions/33462061/sql-server-substring-position-negative-value but also asks: now the only question that remains is, "why would anyone need it to behave this way?" and the answer given is: @mao47 Well, it depends. I am not an author of implementation of SUBSTR but I guess with negative index it behaves like LEFT(string, LEN(string) - 1 - index). It works the same way in PostgreSQL so maybe it is SQL standard. Informix has substring() which matches the SQL standard, and substr() which uses negative start from the end of the string: https://www.ibm.com/docs/en/informix-servers/14.10?topic=smf-substr-function Oracle doesn't have substring(), just substr(), and matches Informix behavior, I think. I have developed the attached doc patch to document this. The only question is whether this substring behavior is so odd that we should not document it. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index e5fa82c161..45553ab824 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -2783,6 +2783,13 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in specified. Provide at least one of start and count. + +Non-positive start values specify a +start position before the first character, and therefore the +count must be two greater than the absolute +value of start to begin returning characters +from the start of the string. + substring('Thomas' from 2 for 3) hom @@ -2794,6 +2801,10 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in substring('Thomas' for 2) Th + + +substring('Thomas' from -4 for 8) +Tho
Re: substring start position behavior
On Tue, Mar 5, 2024 at 09:47:54PM -0500, dansonlineprese...@gmail.com wrote: > Thanks for all that context and diff! The odd behavior is indeed in > the SQL spec. At least I was convinced of that on postgres IRC by > someone who seemed to have quoted from the spec. I don't think the > feature ought to be hidden because it's odd. > > If I may suggest some specificity to the wording here, I think "start > values less than 1" would avoid confusion about whether 0 is > nonpositive or not, and bring attention to the function being > 1-indexed rather than 0-indexed. Sure, updated patch attached. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index e5fa82c161..f981793f00 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -2783,6 +2783,13 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in specified. Provide at least one of start and count. + +start values less then one specify +a start position before the first character, and therefore +count must be two greater than the absolute +value of start to begin returning characters +from the start of the string. + substring('Thomas' from 2 for 3) hom @@ -2794,6 +2801,10 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in substring('Thomas' for 2) Th + + +substring('Thomas' from -4 for 8) +Tho
Re: substring start position behavior
Bruce Momjian writes: > This web page explains the feature: > > https://stackoverflow.com/questions/33462061/sql-server-substring-position-negative-value > but also asks: > now the only question that remains is, "why would anyone need it > to behave this way?" Yeah. I believe our implementation adheres to the SQL spec, which says this for (in SQL:2021 6.3.2): a) If the character encoding form of is UTF8, UTF16, or UTF32, then, in the remainder of this General Rule, the term “character” shall be taken to mean “unit specified by ”. b) Let C be the value of the , let LC be the length in characters of C, and let S be the value of the . c) If is specified, then let L be the value of and let E be S+L. Otherwise, let E be the larger of LC+1 and S. d) If at least one of C, S, and L is the null value, then the result of the is the null value. e) If E is less than S, then an exception condition is raised: data exception — substring error (22011). [tgl note: given c), this happens if and only if a negative is provided.] f) Case: i) If S is greater than LC or if E is less than 1 (one), then the result of the is the zero-length character string. ii) Otherwise, 1) Let S1 be the larger of S and 1 (one). Let E1 be the smaller of E and LC+1. Let L1 be E1–S1. 2) The result of the is a character string containing the L1 characters of C starting at character number S1 in the same order that the characters appear in C. That's a pretty sterling example of standards-ese that is both unreadable and devoid of any justification. But if you trace through the possible effects of a negative S value, it looks like (1) if L >= 0 is specified and S+L (E) is less than one, the result is an empty string per rule f)i). (2) if L >= 0 is specified and S+L (E) is at least one but less than LC+1, then E is the substring end+1 position. (3) otherwise, a negative S is disregarded and replaced by 1 so far as the substring end calculation is concerned. (4) in any case, a negative S is disregarded and replaced by 1 so far as the substring start calculation is concerned. I'm kind of inclined to not document this weirdness. I especially don't think it's worth giving an example that neither explains the "disregarded" bit nor highlights the dependency on L being given. regards, tom lane