Re: Help needed to locate the csv parser (for Spark bug reporting/fixing)

2022-02-10 Thread Marnix van den Broek
Thanks, Sean! It was actually on the Catalyst side of things, but I found where column pruning pushdown is delegated to univocity, see [1]. I've tried setting the spark configuration *spark.sql.csv.parser.columnPruning.enabled* to *False* and this prevents the bug from happening. I am unfamiliar

Re: Help needed to locate the csv parser (for Spark bug reporting/fixing)

2022-02-10 Thread Sean Owen
It starts in org.apache.spark.sql.execution.datasources.csv.CSVDataSource. Yes univocity is used for much of the parsing. I am not sure of the cause of the bug but it does look like one indeed. In one case the parser is asked to read all fields, in the other, to skip one. The pushdown helps efficie

Help needed to locate the csv parser (for Spark bug reporting/fixing)

2022-02-10 Thread Marnix van den Broek
hi all, Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data incorrectness when data contains sequences similar to the one in the report. I wanted to take a look at the parsing logic to see if I could spot the error to update the issue with more information and to possibly contri