Re: Row pattern recognition

Jacob Champion Fri, 27 Sep 2024 15:27:36 -0700

On Wed, Sep 18, 2024 at 10:00 PM Tatsuo Ishii <is...@postgresql.org> wrote:
>
> Attached are the v22 patches. Just rebased.


Thanks!

With some bigger partitions, I hit an `ERROR:  wrong pos: 1024`. A
test that reproduces it is attached.

While playing with the feature, I've been trying to identify runs of
matched rows by eye. But it's pretty difficult -- the best I can do is
manually count rows using a `COUNT(*) OVER ...`. So I'd like to
suggest that MEASURES be part of the eventual v1 feature, if there's
no other way to determine whether a row was skipped by a previous
match. (That was less obvious to me before the fix in v17.)

--

I've been working on an implementation [1] of SQL/RPR's "parenthesized
language" and preferment order. (These are defined in SQL/Foundation
2023, section 9.41.) The tool gives you a way to figure out, for a
given pattern, what matches are supposed to be attempted and in what
order:

    $ ./src/test/modules/rpr/rpr_prefer "a b? a"
    ( ( a ( b ) ) a )
    ( ( a ( ) ) a )

Many simple patterns result in an infinite set of possible matches. So
if you use an unbounded quantifiers, you have to also use --max-rows
to limit the size of the hypothetical window frame:

    $ ./src/test/modules/rpr/rpr_prefer --max-rows 2 "^ PERMUTE(a*, b+)? $"
    ( ( ^ ( ( ( ( ( ( a ) ( b ) ) ) - ) ) ) ) $ )
    ( ( ^ ( ( ( ( ( ( ) ( b b ) ) ) - ) ) ) ) $ )
    ( ( ^ ( ( ( ( ( ( ) ( b ) ) ) - ) ) ) ) $ )
    ( ( ^ ( ( ( - ( ( ( b b ) ( ) ) ) ) ) ) ) $ )
    ( ( ^ ( ( ( - ( ( ( b ) ( a ) ) ) ) ) ) ) $ )
    ( ( ^ ( ( ( - ( ( ( b ) ( ) ) ) ) ) ) ) $ )
    ( ( ^ ( ) ) $ )

I've found this useful to check my personal understanding of the spec
and the match behavior, but it could also potentially be used to
generate test cases, or to help users debug their own patterns. For
example, a pattern that has a bunch of duplicate sequences in its PL
is probably not very well optimized:

    $ ./src/test/modules/rpr/rpr_prefer --max-rows 4 "a+ a+"
    ( ( a a a ) ( a ) )
    ( ( a a ) ( a a ) )
    ( ( a a ) ( a ) )
    ( ( a ) ( a a a ) )
    ( ( a ) ( a a ) )
    ( ( a ) ( a ) )

And patterns with catastrophic backtracking behavior tend to show a
"sawtooth" pattern in the output, with a huge number of potential
matches being generated relative to the number of rows in the frame.

My implementation is really messy -- it leaks memory like a sieve, and
I cannibalized the parser from ECPG, which just ended up as an
exercise in teaching myself flex/bison. But if there's interest in
having this kind of tool in the tree, I can work on making it
reviewable. Either way, I should be able to use it to double-check
more complicated test cases.

A while back [2], you were wondering whether our Bison implementation
would be able to parse the PATTERN grammar directly. I think this tool
proves that the answer is "yes", but PERMUTE in particular causes a
shift/reduce conflict. To fix it, I applied the same precedence
workaround that we use for CUBE and ROLLUP.

Thanks again!
--Jacob

[1] https://github.com/jchampio/postgres/tree/dev/rpr
[2] 
https://www.postgresql.org/message-id/20230721.151648.412762379013769790.t-ishii%40sranhm.sra.co.jp

commit 819c1abba21c9b59cceedd55962c3fdcb982aad5
Author: Jacob Champion <jacob.champ...@enterprisedb.com>
Date:   Thu Sep 26 14:12:38 2024 -0700

    RPR: test larger window partitions
    
    The second test case fails with
    
        ERROR:  wrong pos: 1024

diff --git a/src/test/regress/expected/rpr.out 
b/src/test/regress/expected/rpr.out
index 0789e09435..dcfd67f143 100644
--- a/src/test/regress/expected/rpr.out
+++ b/src/test/regress/expected/rpr.out
@@ -834,3 +834,40 @@ ERROR:  SEEK is not supported
 LINE 8:  SEEK
          ^
 HINT:  Use INITIAL.
+-- Smoke test for larger partitions.
+WITH s AS (
+ SELECT v, count(*) OVER w AS c
+ FROM (SELECT generate_series(1, 5000) v)
+ WINDOW w AS (
+  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
+  AFTER MATCH SKIP PAST LAST ROW
+  INITIAL
+  PATTERN ( r+ )
+  DEFINE r AS TRUE
+ )
+)
+-- Should be exactly one long match across all rows.
+SELECT * FROM s WHERE c > 0;
+ v |  c   
+---+------
+ 1 | 5000
+(1 row)
+
+WITH s AS (
+ SELECT v, count(*) OVER w AS c
+ FROM (SELECT generate_series(1, 5000) v)
+ WINDOW w AS (
+  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
+  AFTER MATCH SKIP PAST LAST ROW
+  INITIAL
+  PATTERN ( r )
+  DEFINE r AS TRUE
+ )
+)
+-- Every row should be its own match.
+SELECT count(*) FROM s WHERE c > 0;
+ count 
+-------
+  5000
+(1 row)
+
diff --git a/src/test/regress/sql/rpr.sql b/src/test/regress/sql/rpr.sql
index 302e2b86a5..6ff61e6d60 100644
--- a/src/test/regress/sql/rpr.sql
+++ b/src/test/regress/sql/rpr.sql
@@ -403,3 +403,32 @@ SELECT company, tdate, price, first_value(price) OVER w, 
last_value(price) OVER
   UP AS price > PREV(price),
   DOWN AS price < PREV(price)
 );
+
+-- Smoke test for larger partitions.
+WITH s AS (
+ SELECT v, count(*) OVER w AS c
+ FROM (SELECT generate_series(1, 5000) v)
+ WINDOW w AS (
+  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
+  AFTER MATCH SKIP PAST LAST ROW
+  INITIAL
+  PATTERN ( r+ )
+  DEFINE r AS TRUE
+ )
+)
+-- Should be exactly one long match across all rows.
+SELECT * FROM s WHERE c > 0;
+
+WITH s AS (
+ SELECT v, count(*) OVER w AS c
+ FROM (SELECT generate_series(1, 5000) v)
+ WINDOW w AS (
+  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
+  AFTER MATCH SKIP PAST LAST ROW
+  INITIAL
+  PATTERN ( r )
+  DEFINE r AS TRUE
+ )
+)
+-- Every row should be its own match.
+SELECT count(*) FROM s WHERE c > 0;

Re: Row pattern recognition

Reply via email to