Hi hackers,

I'd like to propose an extension to the Row Pattern Recognition (RPR)
absorption mechanism to handle anchored patterns (those starting with
START or similar prefix elements).

== Background ==

The current RPR implementation includes an absorption optimization that
reduces time complexity from O(n²) to O(n) for unbounded quantifiers
like A+. When multiple match contexts are at the same ABSORBABLE element,
the context that started earlier (having matched more rows) absorbs
later-starting contexts, eliminating redundant processing.

For example, with pattern "A+ B":
  - Ctx1 starts at row 0, matching A
  - Ctx2 starts at row 1, matching A
  - Ctx1 always has more A's → Ctx2 is absorbed (removed)
  - Result: Only 1 context maintained, O(n) complexity

== The Problem ==

This absorption mechanism breaks down for anchored patterns like
"START SECOND A+ B". The PREFIX elements (START, SECOND) block
absorption because contexts at different PREFIX positions cannot
be meaningfully compared.

Without absorption, we regress to O(n²) behavior as contexts
accumulate without being eliminated.

== Proposed Solution: Original/Alternate Simultaneous Progression ==

The key insight is to track two logical "paths" within each context:

1. Original: Starts at Element 0 (PREFIX + BODY)
   - can_absorb = true
   - Can be used as matchedState (actual match result)

2. Alternate: Starts at Element 2 (BODY only, skipping PREFIX)
   - can_be_absorbed = true
   - Used only for absorption eligibility checking
   - Cannot be used as matchedState

Both paths process the same row data simultaneously. The alternate
path serves as a "shadow" that indicates whether the context is
eligible for absorption.

== Absorption Conditions ==

For Ctx1 to absorb Ctx2, all conditions must be met:

1. Ctx1.can_absorb = true (original not outside pattern region)
2. Ctx2.can_be_absorbed = true (alternate still alive in BODY)
3. Ctx2's original has reached ABSORBABLE (PREFIX matching confirmed)
4. Ctx1 started before Ctx2

Note: No count comparison is needed. The alternate being alive
confirms absorption eligibility; the earlier-starting context
always wins.

== Why Wait for Ctx2's Original to Reach ABSORBABLE? ==

We cannot absorb Ctx2 immediately when it starts. We must wait
until Ctx2's original successfully passes through PREFIX elements
(START, SECOND) and reaches the ABSORBABLE point (A+). This
confirms that Ctx2's PREFIX matching succeeded before we remove it.

== Why Absorption is Safe (Common Fate) ==

Once both contexts reach the same ABSORBABLE element (A+), they
share a "common fate":

  - Same element position, same future data stream
  - Same choices ahead (continue A+ or transition to B)
  - If Ctx1 succeeds → Ctx2 would have succeeded too (but shorter match)
  - If Ctx1 fails → Ctx2 would have failed too (nothing lost)
  - The only difference is in the past (rows already matched),
    which doesn't affect future matching decisions

This property guarantees that absorption is safe: we never lose
a potential match by absorbing a later-starting context into an
earlier one.

== Flag Design ==

Compile-time (Element flags):
  - RPR_ELEM_ABSORBABLE: Unchanged (marks WHERE comparison happens)
  - RPR_ELEM_ABSORBABLE_BRANCH_PREFIX: New (PREFIX region, blocks
absorption)
  - RPR_ELEM_ABSORBABLE_BRANCH_BODY: New (BODY region, allows absorption)

Runtime (State variables):
  - can_absorb: "Can absorb others" (original=true, alternate=false)
  - can_be_absorbed: "Can be absorbed" (PREFIX→false, BODY→true)

Both flags are monotonic: once false, they cannot become true again.

== Complexity Analysis ==

With this extension:
  - Maximum concurrent contexts: PREFIX_length + 1
  - For "START SECOND A+ B": max 3 contexts at any time
  - Overall complexity: O(n) maintained

Example flow for "START SECOND A+ B" with data r0-r4(A,A,A,A,B):
  Row 0: Ctx1 starts                                           [1]
  Row 1: Ctx2 starts (cannot absorb: PREFIX not confirmed)     [2]
  Row 2: Ctx3 starts                                           [3]
  Row 3: Ctx4 starts, Ctx2 absorbed (PREFIX confirmed)         [3]
  Row 4: Ctx1 matches B, Ctx3-5 discarded                      [1]

Note: This is not an immediate priority. Stabilizing the current
RPR patch comes first. I'm sharing this design early to get feedback
and ensure the approach is sound before implementation.

I have a working design document with detailed state transition
rules and examples. Happy to share more details.

Thoughts?

--
Best regards,
Henson

Reply via email to