Ted Chester Jenks created SPARK-51746:
-----------------------------------------

             Summary: Data dropped when aggregate on CSV corrupt_record column
                 Key: SPARK-51746
                 URL: https://issues.apache.org/jira/browse/SPARK-51746
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 3.5.5
            Reporter: Ted Chester Jenks


When parsing a simple, invalid CSV into Spark (3.5.5) I am able to drop data 
during aggregation.

The data:
{noformat}
col1,col2,col3
something,corrupt
not,so,bad
bad {noformat}
Reading this with permissive parsing works as expected. However, the following 
results in the corrupt_record data to be dropped:
{noformat}
df.groupBy("col1").agg(collect_list("corrupt_record").alias("corrupt_record")){noformat}
The bug is fixed if one caches before doing the aggregate.

This is similar to when you try to perform other operations on a corrupt_record 
column, e.g. 
[https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-hdp/topics/cpd-one-workload-migration-spark-corrupt-csv.html|https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-hdp/topics/cpd-one-workload-migration-spark-corrupt-csv.html.].
 However, in other cases an informative error is produced, rather than data 
silently dropped. I think that an error would be better here too.

 

!image-2025-04-08-15-11-11-498.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to