I reported a bug in ZStandardCodec in the Hadoop library. If you run Hive with
ZStandard compression for Tez intermediate data, you might be affected by this
bug.
https://issues.apache.org/jira/browse/HDFS-14099
The problem occurs when the input file is large (e.g., 25MB) and does not
compress well. (The zstd native library is fine, as zstd successfully compresses
and restores the same input data.) I think it rarely occurs in practice, but we
ran into this problem when testing with 10TB TPC-DS data. A sample query for
reproducing the problem is (with hive.execution.mode=llap):
select /*+ semi(store_returns, sr_ticket_number, store_sales, 34171240) */
ss_quantity
from store_sales, store_returns
where ss_ticket_number = sr_ticket_number and
sr_returned_date_sk between 2451789 and 2451818
limit 100;
--- Sungwoo