XiangpengHao commented on issue #14816: URL: https://github.com/apache/datafusion/issues/14816#issuecomment-2682636805
Hi @Arpit-Bandejiya sorry I've been quite busy these days. If you have a bitmask and want to only read the flagged rows from Parquet, you can directly use ParquetRecordBatchBuilder::with_row_selection: https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection If you want DataFusion to produce a bitmask for other systems -- I'm not aware of an easy way to do this. But this sounds like a join use case, have you considered adding a row_id to the parquet files? so that you can select the row_id as the output and join with other systems. DataFusion has no control over the row id read from Parquet, especially with filter pushdown, where row ids are heavily filtered. Even changing the `ParquetRecordBatchStream` as @bharath-techie pointed out is not enough, as concurrent reading can happen, it's possible but quite hard to determine the starting row_id of each stream. In fact, the reader has the freedom to emit rows in any order, as long as they are logically equivalent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org