I reported a bug in ZStandardCodec in the Hadoop library. If you run Hive with ZStandard compression for Tez intermediate data, you might be affected by this bug.

https://issues.apache.org/jira/browse/HDFS-14099

The problem occurs when the input file is large (e.g., 25MB) and does not compress well. (The zstd native library is fine, as zstd successfully compresses and restores the same input data.) I think it rarely occurs in practice, but we ran into this problem when testing with 10TB TPC-DS data. A sample query for reproducing the problem is (with hive.execution.mode=llap):

select /*+ semi(store_returns, sr_ticket_number, store_sales,  34171240)  */
  ss_quantity
from store_sales, store_returns
where ss_ticket_number = sr_ticket_number and
sr_returned_date_sk between 2451789 and 2451818
limit 100;


--- Sungwoo

Reply via email to