hi all,

Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data
incorrectness when data contains sequences similar to the one in the
report.

I wanted to take a look at the parsing logic to see if I could spot the
error to update the issue with more information and to possibly contribute
a PR with a bug fix, but I got completely lost navigating my way down the
dependencies in the Spark repository. Can someone point me in the right
direction?

I am looking for the csv parser itself, which is likely a dependency?

The next question might need too much knowledge about Spark internals to
know where to look or understand what I'd be looking at, but I am also
looking to see if and why the implementation of the CSV parsing is
different when columns are projected as opposed to the processing of the
full dataframe/ The issue only occurs when projecting columns and this
inconsistency is a worry in itself.

Many thanks,

Marnix

1. https://issues.apache.org/jira/browse/SPARK-38167

Reply via email to