findepi commented on issue #13261:
URL: https://github.com/apache/datafusion/issues/13261#issuecomment-2458092019

   > use a function like `ROW_NUMBER` to figure out the positions of rows. It 
would be great if the parquet reader machinery could expose this information 
directly instead.
   
   The SQL-level approach would work only if the source file isn't filtered: no 
predicates, no pre-existing deletion vectors, etc.
   I agree with the assessment that the information must be coning from the 
file reader itself.
   
   
   > ### Describe the solution you'd like
   > I'm not sure what a good API would look like here, but one idea is that 
the parquet reader could expose some new option that enables row position 
information to be returned as some special column name. I.E.
   > 
   > ```rust
   > let ctx = 
SessionContext::new_with_config(SessionConfig::default().set_bool("datafusion.execution.parquet.include_row_position",
 true))
   > let record_batches = 
ctx.read_parquet("foo.parquet").filter(filters).select(PARQUET_ROW_POSITION).collect();
   > // record batches now contains the indexes of rows in "foo.parquet" that 
match the provided filters.
   > ```
   
   i like the syntax
   @alamb can this be handled with some form of a hidden column?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to