Thanks, Sean!

It was actually on the Catalyst side of things, but I found where column
pruning pushdown is delegated to univocity, see [1].

I've tried setting the spark configuration
*spark.sql.csv.parser.columnPruning.enabled* to *False* and this prevents
the bug from happening. I am unfamiliar with Java / Scala so I might be
misreading things, but to me everything points to a bug in univocity,
specifically in how the *selectIndexes* parser setting impacts the parsing
of the example in the bug report.

This means that to fix this bug, univocity must be fixed and Spark then
needs to refer to a fixed version, correct? Unless someone thinks this
analysis is off, I'll add this info to the Spark issue and file a bug
report with univocity.

1.
https://github.com/apache/spark/blob/6a59fba248359fb2614837fe8781dc63ac8fdc4c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L79

On Thu, Feb 10, 2022 at 5:39 PM Sean Owen <sro...@gmail.com> wrote:

> It starts in org.apache.spark.sql.execution.datasources.csv.CSVDataSource.
> Yes univocity is used for much of the parsing.
> I am not sure of the cause of the bug but it does look like one indeed. In
> one case the parser is asked to read all fields, in the other, to skip one.
> The pushdown helps efficiency but something is going wrong.
>
> On Thu, Feb 10, 2022 at 10:34 AM Marnix van den Broek <
> marnix.van.den.br...@bundlesandbatches.io> wrote:
>
>> hi all,
>>
>> Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data
>> incorrectness when data contains sequences similar to the one in the
>> report.
>>
>> I wanted to take a look at the parsing logic to see if I could spot the
>> error to update the issue with more information and to possibly contribute
>> a PR with a bug fix, but I got completely lost navigating my way down the
>> dependencies in the Spark repository. Can someone point me in the right
>> direction?
>>
>> I am looking for the csv parser itself, which is likely a dependency?
>>
>> The next question might need too much knowledge about Spark internals to
>> know where to look or understand what I'd be looking at, but I am also
>> looking to see if and why the implementation of the CSV parsing is
>> different when columns are projected as opposed to the processing of the
>> full dataframe/ The issue only occurs when projecting columns and this
>> inconsistency is a worry in itself.
>>
>> Many thanks,
>>
>> Marnix
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-38167
>>
>>

Reply via email to