If I recall this problem correctly, the root cause is the default zstd compression block size is 256kb, and Hadoop Zstd compression will attempt to use the OS platform default compression size, if it is available. The recommended output size is slightly bigger than input size to account for header size in Zstd compression. http://software.icecube.wisc.edu/coverage/00_LATEST/icetray/private/zstd/lib/compress/zstd_compress.c.gcov.html#2982
Where, Hadoop code https://github.com/apache/hadoop/blame/trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L259 is setting output size to the same as input size, if input size is bigger than output size. By manually setting buffer size to a small value, input size will be smaller than recommended output size to keep the system working. By returning ZTD_CStreamOutSize() in getSteramSize, it may enable the system to work without a predefined default. On Mon, May 11, 2020 at 2:29 PM Wei-Chiu Chuang <weic...@cloudera.com.invalid> wrote: > Thanks for the pointer, it does look similar. However we are roughly on the > latest of branch-3.1 and this fix is in our branch. I'm pretty sure we have > all the zstd fixes. > > I believe the libzstd version used is 1.4.4 but need to confirm. I > suspected it's a library version issue because we've been using zstd > compression for over a year, and this bug (reproducible) happens > consistently just recently. > > On Mon, May 11, 2020 at 1:57 PM Ayush Saxena <ayush...@gmail.com> wrote: > > > Hi Wei Chiu, > > What is the Hadoop version being used? > > Give a check if HADOOP-15822 is there, it had something similar error. > > > > -Ayush > > > > > On 11-May-2020, at 10:11 PM, Wei-Chiu Chuang <weic...@apache.org> > wrote: > > > > > > Hadoop devs, > > > > > > A colleague of mine recently hit a strange issue where zstd compression > > > codec crashes. > > > > > > Caused by: java.lang.InternalError: Error (generic) > > > at > > > > > > org.apache.hadoop.io.compress.zstd.ZStandardCompressor.deflateBytesDirect(Native > > > Method) > > > at > > > > > > org.apache.hadoop.io.compress.zstd.ZStandardCompressor.compress(ZStandardCompressor.java:216) > > > at > > > > > > org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) > > > at > > > > > > org.apache.hadoop.io.compress.CompressorStream.write(CompressorStream.java:76) > > > at > > > > > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57) > > > at java.io.DataOutputStream.write(DataOutputStream.java:107) > > > at > > > > > > org.apache.tez.runtime.library.common.sort.impl.IFile$Writer.writeKVPair(IFile.java:617) > > > at > > > > > > org.apache.tez.runtime.library.common.sort.impl.IFile$Writer.append(IFile.java:480) > > > > > > Anyone out there hitting the similar problem? > > > > > > A temporary workaround is to set buffer size "set > > > io.compression.codec.zstd.buffersize=8192;" > > > > > > We suspected it's a bug in zstd library, but couldn't verify. Just want > > to > > > send this out and see if I can get some luck. > > >