Re: Optimize Equality Deletes with Sorting

2025-04-07 Thread Daniel Weeks
Hey Edgar, Thanks for the well articulated proposal. I'm a little concerned that the proposed approach only partially addresses the underlying challenge with equality deletes. Equality deletes are extremely powerful because you can delete a row anywhere in the dataset without any read cost. The

Re: Optimize Equality Deletes with Sorting

2025-04-05 Thread Péter Váry
Hi Edgar, Thanks for the well described proposal! Knowing the Flink connector, I have the following concerns: - Flink connector currently doesn't sort the rows in the data files. It "chickens" out of this to avoid keeping anything in memory. - Sorting the equality delete rows would also add memor

Re: Optimize Equality Deletes with Sorting

2025-04-02 Thread Gang Wu
CMIW, the spec does not enforce `identifier fields` for equality delete files. Engines are free to use different `equality_ids` among commits, though the use case should be rare. Similarly, what sort order should we use? It is common for a table to set sort order on columns other than the primary k

Optimize Equality Deletes with Sorting

2025-04-01 Thread Edgar Rodriguez
Hi all, I know there's been some conversations regarding optimization of equality deletes and even their possible deprecation. We have been thinking internally about a way to optimize merge-on-read with equality deletes to better balance the read performance while having the benefits of performant