djspiewak opened a new pull request, #46928:
URL: https://github.com/apache/spark/pull/46928

   This PR fixes an issue in the vectorized parquet reader with respect to 
executing the `explode` function on nested arrays where the array cuts across 
two or more pages. It's probably possible to minimize this slightly more but I 
wasn't able to find a reproducer. It's also worth noting that this issue 
illustrates a current gap in the lower-level unit tests for the vectorized 
reader, which don't appear to test much related to output vector offsets.
   
   The bug in question was a simple typo: the output row offset was used to 
dereference nested array lengths rather than input row offset. This only 
matters for the explode function and then only when resuming the same operation 
on a second page. This case (and all related cases) are, at present, untested. 
I added a high-level test and example `.parquet` file which reproduces the 
issue and verifies the fix, but it would be ideal if more tests were added at a 
lower level. It is very likely that other similar bugs are present within the 
vectorized reader as it relates to nested substructures remapped during the 
query pipeline.
   
   ### What changes were proposed in this pull request?
   
   It's a fairly straightforward typo issue in the code. 
   
   ### Why are the changes needed?
   
   The vectorized parquet reader does not correctly handle this case
   
   ### Does this PR introduce _any_ user-facing change?
   
   Aside from fixing the vectorized reader? No.
   
   ### How was this patch tested?
   
   Unit test (well, more of an integration test) included in PR
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Nope


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to