On Wed, Feb 24, 2021, at 16:23, I wrote: >I will hopefully be able to provide a similar analysis of \D soon, >but wanted to send this in the meantime.
CREATE TABLE "\D" AS SELECT * FROM regex_tests WHERE processed AND error_pg IS NULL AND pattern LIKE '%\\D%'; SELECT 67558 CREATE TABLE "\D diff" AS SELECT *, regexp_match(subject, '('||pattern||')', 'n') AS captured_pg_0001 FROM "\D" WHERE captured_pg IS DISTINCT FROM regexp_match(subject, '('||pattern||')', 'n'); SELECT 12 SELECT COUNT(*), pattern FROM "\D diff" GROUP BY 2 ORDER BY 1 DESC; count | pattern -------+---------- 11 | \D 1 | [\D|\d]* (2 rows) Pattern 1: \D ============ This pattern is used to find the first decimal separator, normally dot (.): SELECT subject FROM regex_tests WHERE pattern = '\D' ORDER BY RANDOM() LIMIT 10; subject --------------------------- 1.11.00.24975645674952163 1.11.30.6944442955860683 1.12.40.38502468714280424 3.5.10.9407443094500285 1.12.40.34334381021879845 2.0.20.5175496920692813 1.8.30.09144561055484002 3.4.10.6083619758942858 3.5.10.15406771889459425 2.0.00.6309370335082272 (10 rows) We can see how this works in almost all cases: SELECT captured_pg, captured_v8, count(*) from regex_tests where pattern = '\D' GROUP BY 1,2 ORDER BY 3 DESC LIMIT 3; captured_pg | captured_v8 | count -------------+-------------+------- {.} | {.} | 66797 | | 103 {-} | {-} | 64 (10 rows) If we take a look at the diffs found, all such cases have a subjects that starts with newlines: SELECT COUNT(*), subject ~ '^\n' AS starts_with_newline FROM "\D diff" WHERE pattern = '\D' GROUP BY 2; count | starts_with_newline -------+--------------------- 11 | t (1 row) Naturally, if newlines are not included, then something else will match instead. Now, if in these cases, ignoring the newline(s) and instead proceeding to match the first non-digit non-newline, maybe we wound find a dot (.) like in the normal case? No, that is not the case. Instead, we will hit some arbitrary blank space or tab: SELECT convert_to(captured_pg[1],'utf8') AS "0001+0002", convert_to(captured_pg_0001[1],'utf8') AS "0001", COUNT(*) FROM "\D diff" WHERE pattern = '\D' GROUP BY 1,2; 0001+0002 | 0001 | count -----------+------+------- \x0a | \x09 | 3 \x0a | \x20 | 7 \x0a | | 1 (3 rows) The last example where nothing at all matched, was due to the string only contained a single newline, which couldn't be matched. None of these outliners contain any decimal-looking-digit-sequences at all, it's all just white space, one "€ EUR" text and some text that looks like it's coming from some web shop's title: SELECT ROW_NUMBER() OVER (), subject FROM "\D diff" WHERE pattern = '\D'; row_number | subject ------------+---------------------------------------------------------------- 1 | + | + | + | 2 | + | 3 | + | 4 | + | 5 | + | € EUR + | 6 | + | 7 | + | 8 | + | 9 | + | 10 | + | Dunjackor, duntäcken och dunkuddar | Joutsen Dunspecialist+ | + | + | + | – Joutsen Sweden + | + | 11 | + | (11 rows) My conclusion is all of these are nonsensical subjects when applied to the \D regex. Out of the subjects with actual digit-sequences, none of them starts with newlines, so including newlines in \D wouldn't cause any effect. I see no benefit, but also no harm, in including newlines. Pattern 2: [\D|\d]* =============== This looks similar to [\w\W], the author has probably not understood pipe ("|") is not needed in between bracket expression parts. The author's intention is probably to match everything in the string, like .*, but including newlines. Patch 0002 therefore gets +1 due to this example. ===END OF PATTERNS=== My final conclusion is we should always include newlines in \D. /Joel