Thanks, Sean! It was actually on the Catalyst side of things, but I found where column pruning pushdown is delegated to univocity, see [1].
I've tried setting the spark configuration *spark.sql.csv.parser.columnPruning.enabled* to *False* and this prevents the bug from happening. I am unfamiliar with Java / Scala so I might be misreading things, but to me everything points to a bug in univocity, specifically in how the *selectIndexes* parser setting impacts the parsing of the example in the bug report. This means that to fix this bug, univocity must be fixed and Spark then needs to refer to a fixed version, correct? Unless someone thinks this analysis is off, I'll add this info to the Spark issue and file a bug report with univocity. 1. https://github.com/apache/spark/blob/6a59fba248359fb2614837fe8781dc63ac8fdc4c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L79 On Thu, Feb 10, 2022 at 5:39 PM Sean Owen <sro...@gmail.com> wrote: > It starts in org.apache.spark.sql.execution.datasources.csv.CSVDataSource. > Yes univocity is used for much of the parsing. > I am not sure of the cause of the bug but it does look like one indeed. In > one case the parser is asked to read all fields, in the other, to skip one. > The pushdown helps efficiency but something is going wrong. > > On Thu, Feb 10, 2022 at 10:34 AM Marnix van den Broek < > marnix.van.den.br...@bundlesandbatches.io> wrote: > >> hi all, >> >> Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data >> incorrectness when data contains sequences similar to the one in the >> report. >> >> I wanted to take a look at the parsing logic to see if I could spot the >> error to update the issue with more information and to possibly contribute >> a PR with a bug fix, but I got completely lost navigating my way down the >> dependencies in the Spark repository. Can someone point me in the right >> direction? >> >> I am looking for the csv parser itself, which is likely a dependency? >> >> The next question might need too much knowledge about Spark internals to >> know where to look or understand what I'd be looking at, but I am also >> looking to see if and why the implementation of the CSV parsing is >> different when columns are projected as opposed to the processing of the >> full dataframe/ The issue only occurs when projecting columns and this >> inconsistency is a worry in itself. >> >> Many thanks, >> >> Marnix >> >> 1. https://issues.apache.org/jira/browse/SPARK-38167 >> >>