Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

Will Jones Fri, 02 Jun 2023 08:45:40 -0700

> The main downside with using the mask (or any solution based on a filter
> node / filtering) is that it requires that the delete indices go into the
> plan itself.  So you need to first read the delete files and then create
> the plan.  And, if there are many deleted rows, this can be costly.


Ah, I see. I was assuming you could load the indices within the fragment
scan, at the same time the page index was read. That's how I'm implementing
with Lance, and how I plan to implement with Delta Lake. But if you can't
do that, then filtering with an anti-join makes sense. You wouldn't want to
include those in a plan.

On Fri, Jun 2, 2023 at 7:38 AM Weston Pace <[email protected]> wrote:

> Also, for clarity, I do agree with Gang that these are both valuable
> features in their own right.  A mask makes a lot of sense for page indices.
>
> On Fri, Jun 2, 2023 at 7:36 AM Weston Pace <[email protected]> wrote:
>
> > > then I think the incremental cost of adding the
> > > positional deletes to the mask is probably lower than the anti-join.
> > Do you mean developer cost?  Then yes, I agree.  Although there may be
> > some subtlety in the pushdown to connect a dataset filter to a parquet
> > reader filter.
> >
> > The main downside with using the mask (or any solution based on a filter
> > node / filtering) is that it requires that the delete indices go into the
> > plan itself.  So you need to first read the delete files and then create
> > the plan.  And, if there are many deleted rows, this can be costly.
> >
> > On Thu, Jun 1, 2023 at 7:13 PM Will Jones <[email protected]>
> wrote:
> >
> >> That's a good point, Gang. To perform deletes, we definitely need the
> row
> >> index, so we'll want that regardless of whether it's used in scans.
> >>
> >> > I'm not sure a mask would be the ideal solution for Iceberg (though it
> >> is
> >> a reasonable feature in its own right) because I think position-based
> >> deletes, in Iceberg, are still done using an anti-join and not a filter.
> >>
> >> For just positional deletes in isolation, I agree the mask wouldn't be
> >> more
> >> optimal than the anti-join. But if they end up using the mask for
> >> filtering
> >> with the page index, then I think the incremental cost of adding the
> >> positional deletes to the mask is probably lower than the anti-join.
> >>
> >> On Thu, Jun 1, 2023 at 6:33 PM Gang Wu <[email protected]> wrote:
> >>
> >> > IMO, the adding a row_index column from the reader is orthogonal to
> >> > the mask implementation. Table formats (e.g. Apache Iceberg and
> >> > Delta) require the knowledge of row index to finalize row deletion. It
> >> > would be trivial to natively support row index from the file reader.
> >> >
> >> > Best,
> >> > Gang
> >> >
> >> > On Fri, Jun 2, 2023 at 3:40 AM Weston Pace <[email protected]>
> >> wrote:
> >> >
> >> > > I agree that having a row_index is a good approach.  I'm not sure a
> >> mask
> >> > > would be the ideal solution for Iceberg (though it is a reasonable
> >> > feature
> >> > > in its own right) because I think position-based deletes, in
> Iceberg,
> >> are
> >> > > still done using an anti-join and not a filter.
> >> > >
> >> > > That being said, we probably also want to implement a streaming
> >> > merge-based
> >> > > anti-join because I believe delete files are ordered by row_index
> and
> >> so
> >> > a
> >> > > streaming approach is likely to be much more performant.
> >> > >
> >> > > On Mon, May 29, 2023 at 4:01 PM Will Jones <[email protected]
> >
> >> > > wrote:
> >> > >
> >> > > > Hi Rusty,
> >> > > >
> >> > > > At first glance, I think adding a row_index column would make
> >> sense. To
> >> > > be
> >> > > > clear, this would be an index within a file / fragment, not across
> >> > > multiple
> >> > > > files, which don't necessarily have a known ordering in Acero
> >> (IIUC).
> >> > > >
> >> > > > However, another approach would be to take a mask argument in the
> >> > Parquet
> >> > > > reader. We may wish to do this anyways for support for using
> >> predicate
> >> > > > pushdown with Parquet's page index. While Arrow C++ hasn't yet
> >> > > implemented
> >> > > > predicate pushdown on page index (right now just supports row
> >> groups),
> >> > > > Arrow Rust has and provides an API to pass in a mask to support
> it.
> >> The
> >> > > > reason for this implementation is described in the blog post
> >> "Querying
> >> > > > Parquet with Millisecond Latency" [1], under "Page Pruning". The
> >> > > > RowSelection struct API is worth a look [2].
> >> > > >
> >> > > > I'm not yet sure which would be preferable, but I think adopting a
> >> > > similar
> >> > > > pattern to what the Rust community has done may be wise. It's
> >> possible
> >> > > that
> >> > > > row_index is easy to implement while the mask will take time, in
> >> which
> >> > > case
> >> > > > row_index makes sense as an interim solution.
> >> > > >
> >> > > > Best,
> >> > > >
> >> > > > Will Jones
> >> > > >
> >> > > > [1]
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> >> > > > [2]
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://docs.rs/parquet/40.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html
> >> > > >
> >> > > > On Mon, May 29, 2023 at 2:12 PM Rusty Conover
> >> <[email protected]
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi Arrow Team,
> >> > > > >
> >> > > > > I wanted to suggest an improvement regarding Acero's Scan node.
> >> > > > > Currently, it provides useful information such as
> >> __fragment_index,
> >> > > > > __batch_index, __filename, and __last_in_fragment. However, it
> >> would
> >> > > > > be beneficial to have an additional column that returns an
> overall
> >> > > > > "row index" from the source.
> >> > > > >
> >> > > > > The row index would start from zero and increment for each row
> >> > > > > retrieved from the source, particularly in the case of Parquet
> >> files.
> >> > > > > Is it currently possible to obtain this row index or would
> >> expanding
> >> > > > > the Scan node's behavior be required?
> >> > > > >
> >> > > > > Having this row index column would be valuable in implementing
> >> > support
> >> > > > > for Iceberg's positional-based delete files, as outlined in the
> >> > > > > following link:
> >> > > > >
> >> > > > > https://iceberg.apache.org/spec/#delete-formats
> >> > > > >
> >> > > > > While Iceberg's value-based deletes can already be performed
> using
> >> > the
> >> > > > > support for anti joins, using a projection node does not
> guarantee
> >> > the
> >> > > > > row ordering within an Acero graph. Hence, the inclusion of a
> >> > > > > dedicated row index column would provide a more reliable
> solution
> >> in
> >> > > > > this context.
> >> > > > >
> >> > > > > Thank you for considering this suggestion.
> >> > > > >
> >> > > > > Rusty
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

Reply via email to