[ 
https://issues.apache.org/jira/browse/SPARK-40591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-40591:
-----------------------------
    Affects Version/s: 4.0.1
                       3.5.6
                       4.1.0

> ignoreCorruptFiles results data loss
> ------------------------------------
>
>                 Key: SPARK-40591
>                 URL: https://issues.apache.org/jira/browse/SPARK-40591
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.0, 3.3.0, 3.2.2, 3.4.0, 4.1.0, 3.5.6, 4.0.1
>            Reporter: Kent Yao 2
>            Priority: Critical
>              Labels: correctness
>         Attachments: image-2022-09-28-09-20-21-693.png
>
>
> Let's take a look at the case below, the left and the right are visiting the 
> same table and its partitions, and both of them are ignoreCorruptFiles=true. 
> The right side shows that a task skips partial of data it reads because of 
> encountering 'corrupt data', while the left read this file correctly. As 
> ignoreCorruptFiles coarsely works with RuntimeException and IOException, it 
> can not always represent data corruption.
> !image-2022-09-28-09-20-21-693.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to