Also, for clarity, I do agree with Gang that these are both valuable features in their own right. A mask makes a lot of sense for page indices.
On Fri, Jun 2, 2023 at 7:36 AM Weston Pace <weston.p...@gmail.com> wrote: > > then I think the incremental cost of adding the > > positional deletes to the mask is probably lower than the anti-join. > Do you mean developer cost? Then yes, I agree. Although there may be > some subtlety in the pushdown to connect a dataset filter to a parquet > reader filter. > > The main downside with using the mask (or any solution based on a filter > node / filtering) is that it requires that the delete indices go into the > plan itself. So you need to first read the delete files and then create > the plan. And, if there are many deleted rows, this can be costly. > > On Thu, Jun 1, 2023 at 7:13 PM Will Jones <will.jones...@gmail.com> wrote: > >> That's a good point, Gang. To perform deletes, we definitely need the row >> index, so we'll want that regardless of whether it's used in scans. >> >> > I'm not sure a mask would be the ideal solution for Iceberg (though it >> is >> a reasonable feature in its own right) because I think position-based >> deletes, in Iceberg, are still done using an anti-join and not a filter. >> >> For just positional deletes in isolation, I agree the mask wouldn't be >> more >> optimal than the anti-join. But if they end up using the mask for >> filtering >> with the page index, then I think the incremental cost of adding the >> positional deletes to the mask is probably lower than the anti-join. >> >> On Thu, Jun 1, 2023 at 6:33 PM Gang Wu <ust...@gmail.com> wrote: >> >> > IMO, the adding a row_index column from the reader is orthogonal to >> > the mask implementation. Table formats (e.g. Apache Iceberg and >> > Delta) require the knowledge of row index to finalize row deletion. It >> > would be trivial to natively support row index from the file reader. >> > >> > Best, >> > Gang >> > >> > On Fri, Jun 2, 2023 at 3:40 AM Weston Pace <weston.p...@gmail.com> >> wrote: >> > >> > > I agree that having a row_index is a good approach. I'm not sure a >> mask >> > > would be the ideal solution for Iceberg (though it is a reasonable >> > feature >> > > in its own right) because I think position-based deletes, in Iceberg, >> are >> > > still done using an anti-join and not a filter. >> > > >> > > That being said, we probably also want to implement a streaming >> > merge-based >> > > anti-join because I believe delete files are ordered by row_index and >> so >> > a >> > > streaming approach is likely to be much more performant. >> > > >> > > On Mon, May 29, 2023 at 4:01 PM Will Jones <will.jones...@gmail.com> >> > > wrote: >> > > >> > > > Hi Rusty, >> > > > >> > > > At first glance, I think adding a row_index column would make >> sense. To >> > > be >> > > > clear, this would be an index within a file / fragment, not across >> > > multiple >> > > > files, which don't necessarily have a known ordering in Acero >> (IIUC). >> > > > >> > > > However, another approach would be to take a mask argument in the >> > Parquet >> > > > reader. We may wish to do this anyways for support for using >> predicate >> > > > pushdown with Parquet's page index. While Arrow C++ hasn't yet >> > > implemented >> > > > predicate pushdown on page index (right now just supports row >> groups), >> > > > Arrow Rust has and provides an API to pass in a mask to support it. >> The >> > > > reason for this implementation is described in the blog post >> "Querying >> > > > Parquet with Millisecond Latency" [1], under "Page Pruning". The >> > > > RowSelection struct API is worth a look [2]. >> > > > >> > > > I'm not yet sure which would be preferable, but I think adopting a >> > > similar >> > > > pattern to what the Rust community has done may be wise. It's >> possible >> > > that >> > > > row_index is easy to implement while the mask will take time, in >> which >> > > case >> > > > row_index makes sense as an interim solution. >> > > > >> > > > Best, >> > > > >> > > > Will Jones >> > > > >> > > > [1] >> > > > >> > > > >> > > >> > >> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ >> > > > [2] >> > > > >> > > > >> > > >> > >> https://docs.rs/parquet/40.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html >> > > > >> > > > On Mon, May 29, 2023 at 2:12 PM Rusty Conover >> <ru...@conover.me.invalid >> > > >> > > > wrote: >> > > > >> > > > > Hi Arrow Team, >> > > > > >> > > > > I wanted to suggest an improvement regarding Acero's Scan node. >> > > > > Currently, it provides useful information such as >> __fragment_index, >> > > > > __batch_index, __filename, and __last_in_fragment. However, it >> would >> > > > > be beneficial to have an additional column that returns an overall >> > > > > "row index" from the source. >> > > > > >> > > > > The row index would start from zero and increment for each row >> > > > > retrieved from the source, particularly in the case of Parquet >> files. >> > > > > Is it currently possible to obtain this row index or would >> expanding >> > > > > the Scan node's behavior be required? >> > > > > >> > > > > Having this row index column would be valuable in implementing >> > support >> > > > > for Iceberg's positional-based delete files, as outlined in the >> > > > > following link: >> > > > > >> > > > > https://iceberg.apache.org/spec/#delete-formats >> > > > > >> > > > > While Iceberg's value-based deletes can already be performed using >> > the >> > > > > support for anti joins, using a projection node does not guarantee >> > the >> > > > > row ordering within an Acero graph. Hence, the inclusion of a >> > > > > dedicated row index column would provide a more reliable solution >> in >> > > > > this context. >> > > > > >> > > > > Thank you for considering this suggestion. >> > > > > >> > > > > Rusty >> > > > > >> > > > >> > > >> > >> >