XiangpengHao commented on issue #14816:
URL: https://github.com/apache/datafusion/issues/14816#issuecomment-2682636805

   Hi @Arpit-Bandejiya sorry I've been quite busy these days.
   
   If you have a bitmask and want to only read the flagged rows from Parquet, 
you can directly use ParquetRecordBatchBuilder::with_row_selection: 
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection
   
   If you want DataFusion to produce a bitmask for other systems -- I'm not 
aware of an easy way to do this. But this sounds like a join use case, have you 
considered adding a row_id to the parquet files? so that you can select the 
row_id as the output and join with other systems.
   
   DataFusion has no control over the row id read from Parquet, especially with 
filter pushdown, where row ids are heavily filtered. Even changing the 
`ParquetRecordBatchStream` as @bharath-techie pointed out is not enough, as 
concurrent reading can happen, it's possible but quite hard to determine the 
starting row_id of each stream.
   In fact, the reader has the freedom to emit rows in any order, as long as 
they are logically equivalent.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to