Re: Row pattern recognition

Jacob Champion Fri, 08 Sep 2023 12:27:37 -0700

On 9/7/23 20:54, Tatsuo Ishii wrote:
>>     DEFINE
>>       A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A',
>>       ...
> 
> But:
> 
> UP AS price > PREV(price)
> 
> also depends on previous row, no?


PREV(CLASSIFIER()) depends not on the value of the previous row but the
state of the match so far. To take an example from the patch:

> * Example:
> * str_set[0] = "AB";
> * str_set[1] = "AC";
> * In this case at row 0 A and B are true, and A and C are true in row 1.

With these str_sets and my example DEFINE, row[1] is only classifiable
as 'A' if row[0] is *not* classified as 'A' at this point in the match.
"AA" is not a valid candidate string, even if it matches the PATTERN.

So if we don't reevaluate the pattern variable condition for the row, we
at least have to prune the combinations that search_str_set() visits, so
that we don't generate a logically impossible combination. That seems
like it could be pretty fragile, and it may be difficult for us to prove
compliance.

>> But it's easy
>> to come up with a pattern where that's the wrong order, like
>>
>>     PATTERN ( A+ (B|A)+ )
>>
>> Now "aaa" will be considered before "aab", which isn't correct.
> 
> Can you explain a little bit more? I think 'aaa' matches a regular
> expression 'a+(b|a)+' and should be no problem before "aab" is
> considered.

Assuming I've understood the rules correctly, we're not allowed to
classify the last row as 'A' if it also matches 'B'. Lexicographic
ordering takes precedence, so we have to try "aab" first. Otherwise our
query could return different results compared to another implementation.

>> Similarly, the assumption that we want to match the longest string only
>> works because we don't allow alternation yet.
> 
> Can you please clarify more on this?

Sure: for the pattern

    PATTERN ( (A|B)+ )

we have to consider the candidate "a" over the candidate "ba", even
though the latter is longer. Like the prior example, lexicographic
ordering is considered more important than the greedy quantifier.
Quoting ISO/IEC 9075-2:2016:

    More precisely, with both reluctant and greedy quantifiers, the set
    of matches is ordered lexicographically, but when one match is an
    initial substring of another match, reluctant quantifiers prefer the
    shorter match (the substring), whereas greedy quantifiers prefer the
    longer match (the “superstring”).

Here, "ba" doesn't have "a" as a prefix, so "ba" doesn't get priority.
ISO/IEC 19075-5:2021 has a big section on this (7.2) with worked examples.

(The "lexicographic order matters more than greediness" concept was the
most mind-bending part for me so far, probably because I haven't figured
out how to translate the concept into POSIX EREs. It wouldn't make sense
to say "the letter 't' can match 'a', 'B', or '3' in this regex", but
that's what RPR is doing.)

>> Cool, I will give this piece some more thought. Do you mind if I try to
>> add some more complicated pattern quantifiers to stress the
>> architecture, or would you prefer to tackle that later? Just alternation
>> by itself will open up a world of corner cases.
> 
> Do you mean you want to provide a better patch for the pattern
> matching part? That will be helpfull.

No guarantees that I'll find a better patch :D But yes, I will give it a
try.

> Because I am currently working
> on the aggregation part and have no time to do it. However, the
> aggregation work affects the v5 patch: it needs a refactoring. So can
> you wait until I release v6 patch? I hope it will be released in two
> weeks or so.

Absolutely!

>> But I'm wondering if we might want to just implement the NFA directly?
>> The current implementation's Cartesian explosion can probably be pruned
>> aggressively, but replaying the entire regex match once for every
>> backtracked step will still duplicate a lot of work.
> 
> Not sure if you mean implementing new regular expression engine
> besides src/backend/regexp. I am afraid it's not a trivial work. The
> current regexp code consists of over 10k lines. What do you think?

Heh, I think it would be pretty foolish for me to code an NFA, from
scratch, and then try to convince the community to maintain it.

But:
- I think we have to implement a parallel parser regardless (RPR PATTERN
syntax looks incompatible with POSIX)
- I suspect we need more control over the backtracking than the current
pg_reg* API is going to give us, or else I'm worried performance is
going to fall off a cliff with usefully-large partitions
- there's a lot of stuff in POSIX EREs that we don't need, and of the
features we do need, the + quantifier is probably one of the easiest
- it seems easier to prove the correctness of a slow, naive,
row-at-a-time engine, because we can compare it directly to the spec

So what I'm thinking is: if I start by open-coding the + quantifier, and
slowly add more pieces in, then it might be easier to see the parts of
src/backend/regex that I've duplicated. We can try to expose those parts
directly from the internal API to replace my bad implementation. And if
there are parts that aren't duplicated, then it'll be easier to explain
why we need something different from the current engine.

Does that seem like a workable approach? (Worst-case, my code is just
horrible, and we throw it in the trash.)

>> I've attached another test case; it looks like last_value() is depending
>> on some sort of side effect from either first_value() or nth_value(). I
>> know the window frame itself is still under construction, so apologies
>> if this is an expected failure.
> 
> Thanks. Fortunately current code which I am working passes the new
> test. I will include it in the next (v6) patch.

Great!

Thanks,
--Jacob

Re: Row pattern recognition

Reply via email to