djspiewak opened a new pull request, #46928: URL: https://github.com/apache/spark/pull/46928
This PR fixes an issue in the vectorized parquet reader with respect to executing the `explode` function on nested arrays where the array cuts across two or more pages. It's probably possible to minimize this slightly more but I wasn't able to find a reproducer. It's also worth noting that this issue illustrates a current gap in the lower-level unit tests for the vectorized reader, which don't appear to test much related to output vector offsets. The bug in question was a simple typo: the output row offset was used to dereference nested array lengths rather than input row offset. This only matters for the explode function and then only when resuming the same operation on a second page. This case (and all related cases) are, at present, untested. I added a high-level test and example `.parquet` file which reproduces the issue and verifies the fix, but it would be ideal if more tests were added at a lower level. It is very likely that other similar bugs are present within the vectorized reader as it relates to nested substructures remapped during the query pipeline. ### What changes were proposed in this pull request? It's a fairly straightforward typo issue in the code. ### Why are the changes needed? The vectorized parquet reader does not correctly handle this case ### Does this PR introduce _any_ user-facing change? Aside from fixing the vectorized reader? No. ### How was this patch tested? Unit test (well, more of an integration test) included in PR ### Was this patch authored or co-authored using generative AI tooling? Nope -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org