Ted Chester Jenks created SPARK-51746: -----------------------------------------
Summary: Data dropped when aggregate on CSV corrupt_record column Key: SPARK-51746 URL: https://issues.apache.org/jira/browse/SPARK-51746 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.5.5 Reporter: Ted Chester Jenks When parsing a simple, invalid CSV into Spark (3.5.5) I am able to drop data during aggregation. The data: {noformat} col1,col2,col3 something,corrupt not,so,bad bad {noformat} Reading this with permissive parsing works as expected. However, the following results in the corrupt_record data to be dropped: {noformat} df.groupBy("col1").agg(collect_list("corrupt_record").alias("corrupt_record")){noformat} The bug is fixed if one caches before doing the aggregate. This is similar to when you try to perform other operations on a corrupt_record column, e.g. [https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-hdp/topics/cpd-one-workload-migration-spark-corrupt-csv.html|https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-hdp/topics/cpd-one-workload-migration-spark-corrupt-csv.html.]. However, in other cases an informative error is produced, rather than data silently dropped. I think that an error would be better here too. !image-2025-04-08-15-11-11-498.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org