qlong opened a new pull request, #54598:
URL: https://github.com/apache/spark/pull/54598

   
   ### What changes were proposed in this pull request?
   
   When PushVariantIntoScan rewrites variant_get() calls into struct field 
accesses, the rewritten predicates reference logical paths like "v.`0`" that 
ParquetFilters cannot resolve to any physical column, so they are dropped and 
row-group skipping is disabled for all shredded variant queries.
   
   This change adds variantExtractionSchema to ParquetFilters, and resolves the 
logical path to the corresponding typed_value leaf in the physical Parquet 
schema.The resolved entries allow predicates on shredded variant to participate 
in row-group skipping.
   
   Array-index paths and fields absent from a file's physical schema are 
skipped.
   
   Jira: https://issues.apache.org/jira/browse/SPARK-55817
   
   ### Why are the changes needed?
   
   Performance improvement. The shreded variant predicates are pushed down to 
participate row group filtering. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   - Unit tests for resolving pushed down shreded variant from logical path to 
phyical column.  
   - Tests to verify that row groups are skipped with parquet filters
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   co-authorized with Claude 4.6 Sonnet.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to