friendlymatthew opened a new issue, #19839:
URL: https://github.com/apache/datafusion/issues/19839

   ### Describe the bug
   
   The `ParquetOpener` errors when reading parquet files that lack page index 
metadata during sparse column chunk reads with row selection masks :neckbeard: 
This manifests as "Invalid offset in sparse column chunk data: [offset], no 
matching page found" errors
   
   I think the main problem here is using 
`ArrowReaderOptions::with_page_index(true)` which internally sets 
`PageIndexPolicy::Required` and strictly requires page index metadata to be 
present. This API was replaced in arrow land with the more flexible 
`PageIndexPolicy` enum that expands behavior from 2 boolean states to 3 policy 
options (Required, Optional, Never)
   
   Related issues
   
   - https://github.com/apache/arrow-rs/issues/9197
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   We should set page index policy to `PageIndexPolicy::Optional`. This way it 
gracefully handles files both with and without page index metadata
   
   
https://github.com/apache/datafusion/blob/6f92ea6005c24441c2462b2fbe6aaefee3af478d/datafusion/datasource-parquet/src/opener.rs#L434-L435
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to