On 9/7/23 20:54, Tatsuo Ishii wrote: >> DEFINE >> A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A', >> ... > > But: > > UP AS price > PREV(price) > > also depends on previous row, no?
PREV(CLASSIFIER()) depends not on the value of the previous row but the state of the match so far. To take an example from the patch: > * Example: > * str_set[0] = "AB"; > * str_set[1] = "AC"; > * In this case at row 0 A and B are true, and A and C are true in row 1. With these str_sets and my example DEFINE, row[1] is only classifiable as 'A' if row[0] is *not* classified as 'A' at this point in the match. "AA" is not a valid candidate string, even if it matches the PATTERN. So if we don't reevaluate the pattern variable condition for the row, we at least have to prune the combinations that search_str_set() visits, so that we don't generate a logically impossible combination. That seems like it could be pretty fragile, and it may be difficult for us to prove compliance. >> But it's easy >> to come up with a pattern where that's the wrong order, like >> >> PATTERN ( A+ (B|A)+ ) >> >> Now "aaa" will be considered before "aab", which isn't correct. > > Can you explain a little bit more? I think 'aaa' matches a regular > expression 'a+(b|a)+' and should be no problem before "aab" is > considered. Assuming I've understood the rules correctly, we're not allowed to classify the last row as 'A' if it also matches 'B'. Lexicographic ordering takes precedence, so we have to try "aab" first. Otherwise our query could return different results compared to another implementation. >> Similarly, the assumption that we want to match the longest string only >> works because we don't allow alternation yet. > > Can you please clarify more on this? Sure: for the pattern PATTERN ( (A|B)+ ) we have to consider the candidate "a" over the candidate "ba", even though the latter is longer. Like the prior example, lexicographic ordering is considered more important than the greedy quantifier. Quoting ISO/IEC 9075-2:2016: More precisely, with both reluctant and greedy quantifiers, the set of matches is ordered lexicographically, but when one match is an initial substring of another match, reluctant quantifiers prefer the shorter match (the substring), whereas greedy quantifiers prefer the longer match (the “superstring”). Here, "ba" doesn't have "a" as a prefix, so "ba" doesn't get priority. ISO/IEC 19075-5:2021 has a big section on this (7.2) with worked examples. (The "lexicographic order matters more than greediness" concept was the most mind-bending part for me so far, probably because I haven't figured out how to translate the concept into POSIX EREs. It wouldn't make sense to say "the letter 't' can match 'a', 'B', or '3' in this regex", but that's what RPR is doing.) >> Cool, I will give this piece some more thought. Do you mind if I try to >> add some more complicated pattern quantifiers to stress the >> architecture, or would you prefer to tackle that later? Just alternation >> by itself will open up a world of corner cases. > > Do you mean you want to provide a better patch for the pattern > matching part? That will be helpfull. No guarantees that I'll find a better patch :D But yes, I will give it a try. > Because I am currently working > on the aggregation part and have no time to do it. However, the > aggregation work affects the v5 patch: it needs a refactoring. So can > you wait until I release v6 patch? I hope it will be released in two > weeks or so. Absolutely! >> But I'm wondering if we might want to just implement the NFA directly? >> The current implementation's Cartesian explosion can probably be pruned >> aggressively, but replaying the entire regex match once for every >> backtracked step will still duplicate a lot of work. > > Not sure if you mean implementing new regular expression engine > besides src/backend/regexp. I am afraid it's not a trivial work. The > current regexp code consists of over 10k lines. What do you think? Heh, I think it would be pretty foolish for me to code an NFA, from scratch, and then try to convince the community to maintain it. But: - I think we have to implement a parallel parser regardless (RPR PATTERN syntax looks incompatible with POSIX) - I suspect we need more control over the backtracking than the current pg_reg* API is going to give us, or else I'm worried performance is going to fall off a cliff with usefully-large partitions - there's a lot of stuff in POSIX EREs that we don't need, and of the features we do need, the + quantifier is probably one of the easiest - it seems easier to prove the correctness of a slow, naive, row-at-a-time engine, because we can compare it directly to the spec So what I'm thinking is: if I start by open-coding the + quantifier, and slowly add more pieces in, then it might be easier to see the parts of src/backend/regex that I've duplicated. We can try to expose those parts directly from the internal API to replace my bad implementation. And if there are parts that aren't duplicated, then it'll be easier to explain why we need something different from the current engine. Does that seem like a workable approach? (Worst-case, my code is just horrible, and we throw it in the trash.) >> I've attached another test case; it looks like last_value() is depending >> on some sort of side effect from either first_value() or nth_value(). I >> know the window frame itself is still under construction, so apologies >> if this is an expected failure. > > Thanks. Fortunately current code which I am working passes the new > test. I will include it in the next (v6) patch. Great! Thanks, --Jacob