Re: Bizarre behavior of \w in a regular expression bracket construct

Joel Jacobson Wed, 24 Feb 2021 08:04:33 -0800

On Wed, Feb 24, 2021, at 16:23, I wrote:
>I will hopefully be able to provide a similar analysis of \D soon,
>but wanted to send this in the meantime.


CREATE TABLE "\D" AS SELECT * FROM regex_tests WHERE processed AND error_pg IS 
NULL AND pattern LIKE '%\\D%';
SELECT 67558

CREATE TABLE "\D diff" AS SELECT *, regexp_match(subject, '('||pattern||')', 
'n') AS captured_pg_0001 FROM "\D" WHERE captured_pg IS DISTINCT FROM 
regexp_match(subject, '('||pattern||')', 'n');
SELECT 12

SELECT COUNT(*), pattern FROM "\D diff" GROUP BY 2 ORDER BY 1 DESC;
count | pattern
-------+----------
    11 | \D
     1 | [\D|\d]*
(2 rows)

Pattern 1: \D
============

This pattern is used to find the first decimal separator, normally dot (.):

SELECT subject FROM regex_tests WHERE pattern = '\D' ORDER BY RANDOM() LIMIT 10;
          subject
---------------------------
1.11.00.24975645674952163
1.11.30.6944442955860683
1.12.40.38502468714280424
3.5.10.9407443094500285
1.12.40.34334381021879845
2.0.20.5175496920692813
1.8.30.09144561055484002
3.4.10.6083619758942858
3.5.10.15406771889459425
2.0.00.6309370335082272
(10 rows)

We can see how this works in almost all cases:

SELECT captured_pg, captured_v8, count(*) from regex_tests where pattern = '\D' 
GROUP BY 1,2 ORDER BY 3 DESC LIMIT 3;
captured_pg | captured_v8 | count
-------------+-------------+-------
{.}         | {.}         | 66797
             |             |   103
{-}         | {-}         |    64
(10 rows)

If we take a look at the diffs found,
all such cases have a subjects that starts with newlines:

SELECT COUNT(*), subject ~ '^\n' AS starts_with_newline FROM "\D diff" WHERE 
pattern = '\D' GROUP BY 2;
count | starts_with_newline
-------+---------------------
    11 | t
(1 row)

Naturally, if newlines are not included, then something else will match instead.

Now, if in these cases, ignoring the newline(s) and instead proceeding
to match the first non-digit non-newline, maybe we wound find a dot (.)
like in the normal case? No, that is not the case. Instead, we will hit
some arbitrary blank space or tab:

SELECT convert_to(captured_pg[1],'utf8') AS "0001+0002", 
convert_to(captured_pg_0001[1],'utf8') AS "0001", COUNT(*) FROM "\D diff" WHERE 
pattern = '\D' GROUP BY 1,2;
0001+0002 | 0001 | count
-----------+------+-------
\x0a      | \x09 |     3
\x0a      | \x20 |     7
\x0a      |      |     1
(3 rows)

The last example where nothing at all matched, was due to the string only 
contained a single newline,
which couldn't be matched.

None of these outliners contain any decimal-looking-digit-sequences at all,
it's all just white space, one "€ EUR" text and some text that looks like
it's coming from some web shop's title:

SELECT ROW_NUMBER() OVER (), subject FROM "\D diff" WHERE pattern = '\D';
row_number |                            subject
------------+----------------------------------------------------------------
          1 |                                                               +
            |                                                               +
            |                                                               +
            |
          2 |                                                               +
            |
          3 |                                                               +
            |
          4 |                                                               +
            |
          5 |                                                               +
            |         € EUR                                                 +
            |
          6 |                                                               +
            |
          7 |                                                               +
            |
          8 |                                                               +
            |
          9 |                                                               +
            |
         10 |                                                               +
            |     Dunjackor, duntäcken och dunkuddar | Joutsen Dunspecialist+
            |                                                               +
            |                                                               +
            |                                                               +
            |       – Joutsen Sweden                                        +
            |                                                               +
            |
         11 |                                                               +
            |
(11 rows)

My conclusion is all of these are nonsensical subjects when applied to the \D 
regex.

Out of the subjects with actual digit-sequences,
none of them starts with newlines,
so including newlines in \D wouldn't cause any effect.

I see no benefit, but also no harm, in including newlines.

Pattern 2: [\D|\d]*
===============

This looks similar to [\w\W], the author has probably not understood pipe ("|") 
is not needed in between bracket expression parts. The author's intention is 
probably to match everything in the string, like .*, but including newlines.

Patch 0002 therefore gets +1 due to this example.

===END OF PATTERNS===

My final conclusion is we should always include newlines in \D.

/Joel

Re: Bizarre behavior of \w in a regular expression bracket construct

Reply via email to