berkaysynnada commented on issue #12700:
URL: https://github.com/apache/datafusion/issues/12700#issuecomment-2411134296

   I have worked through [this 
comment](https://github.com/apache/datafusion/issues/12700#issuecomment-2389022925)
 in detail to determine whether there is an actual bug in the current 
implementation. Initially, my thoughts were similar to @akurmustafa's 
intuition, and I draft a few designs. However, I can now safely say that the 
current implementation is correct, and the tests are valid for both the 
`add_new_orderings()` and `satisfy()` APIs. Below are my brief findings:
   
   The example that triggered the initial bug was `[a, b, c, d]` and `[a, c, b, 
d]`. We thought it could be simplified to `[a, b, d]` and `[a, c, d]`, but this 
is a mistaken assumption.
   
   **Why?**
   
   Consider this table:
   
   | a | b | c | d |
   |---|---|---|---|
   | 1 | 1 | 1 | 1 |
   | 1 | 1 | 2 | 2 |
   | 1 | 2 | 3 | 3 |
   | 1 | 3 | 4 | 1 |
   
   This table is ordered by both `[a, b, c, d]` and `[a, c, b, d]`; also `[a, 
b, d]` and `[a, c, d]` are also valid orderings. However, when we add these 
orderings to the equivalence properties, can we make them stored as `[a, b, d]` 
and `[a, c, d]`?
   
   The answer is **no**.
   
   Consider this table:
   
   | a | b | c | d |
   |---|---|---|---|
   | 1 | 1 | 1 | 1 |
   | 1 | 1 | 2 | 2 |
   | 1 | 2 | 3 | 3 |
   | 1 | 3 | 3 | 1 |
   
   This table also satisfies both `[a, b, c, d]` and `[a, c, b, d]`, but we 
cannot simplify them to `[a, b, d]` and `[a, c, d]`. In this case, `[a, c, d]` 
is not valid.
   
   If someone needs to infer an ordering corresponding to the first table, it 
should be given as `[a, b, d]` and `[a, c, d]` because we can derive both `[a, 
b, c, d]` and `[a, c, b, d]` from these. However, the reverse is not true.
   
   **In short:**
   
   - `[a, c]` + `[b, c]` satisfies both `[a, b, c]` and `[b, a, c]`. (Global 
prefixes can be inserted anywhere in other orderings, and we can safely remove 
the duplicate element closer to the right-hand side.)
   - `[a, b, c]` + `[b, a, c]` does not satisfy `[a, c]` or `[b, c]`. (It 
actually satisfies one of them, but since we cannot know which one, we say it 
does not satisfy either.) However, if we know that `a` is unique, we can ensure 
that `[a, c]` is valid. Similarly, if `b` is unique, then `[b, c]` is valid. 
This is a task for another day.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to