Hi all,
I would like to ask whether there are any known or potential workarounds on
the Spark side for a reproducible failure in Hadoop’s native ZSTD
decompression. The issue appears to be triggered specifically when the
original (uncompressed) file size is smaller than 129 KiB.
Environment:
- Apache Spark 3.5.7 (Scala 2.12) with Hadoop 3.3.4
- libhadoop.so from Apache Hadoop 3.3.6
- libzstd 1.5.4
Summary of the problem:
When Spark reads a ZSTD-compressed file through Hadoop’s native
ZStandardDecompressor, the following errors can be reproduced reliably:
1. For files whose original size is <129 KiB:
java.lang.InternalError: Src size is incorrect
2. Under a slightly different sequence of reads:
java.lang.InternalError: Restored data doesn't match checksum
These errors occur even though the ZSTD files are valid and can be
decompressed normally with the `zstd` CLI tools.
Reproduction procedure:
1. `yes a | head -n 65536 > file_128KiB.txt` (128 KiB)
2. `zstd file_128KiB.txt`
3. Validate with `zstd -lv` and `zstdcat`.
4. In PySpark:
`spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()`
5. The executor raises `InternalError: Src size is incorrect`.
A second sequence involving both 129 KiB and 128 KiB files can reproduce:
`InternalError: Restored data doesn't match checksum`.
Details including stack traces and command steps are included in my comment
to Hadoop. https://issues.apache.org/jira/browse/HADOOP-18799
Thanks
--
*camper42*
Douban, Inc.
E-mail: [email protected]