Hello Sungwoo,

It looks like HDFS-14099 was fixed in Apache Hadoop releases 3.2.3, 3.3.2
and 3.4.0. Users of Apache Hive 3.1.3 and earlier would be impacted. (Hive
3.1.3 used Hadoop 3.1.0.) However, Hive 4.0.0 used Hadoop 3.3.6, so I think
we can consider this resolved by the latest Hive release.

If you think there is still a bug remaining in the latest releases (in
either Hadoop or Hive), please let us know.

Chris Nauroth


On Tue, Feb 4, 2025 at 5:25 AM Ayush Saxena <ayush...@gmail.com> wrote:

> Thanks Sungwoo Park for sharing the details. I'm forwarding this to
> hdfs-dev@. I haven't had the chance to review the details in the
> ticket yet, but if you can reproduce the issue, I recommend creating
> an HDFS ticket and marking it as a blocker with the target versions
> set to the upcoming releases. (3.4.2 & 3.5.0, If I am not mistaken)
>
> -Ayush
>
>
> On Mon, 3 Feb 2025 at 08:20, Sungwoo Park <c...@pl.postech.ac.kr> wrote:
> >
> > I reported a bug in ZStandardCodec in the Hadoop library. If you run
> Hive with
> > ZStandard compression for Tez intermediate data, you might be affected
> by this
> > bug.
> >
> > https://issues.apache.org/jira/browse/HDFS-14099
> >
> > The problem occurs when the input file is large (e.g., 25MB) and does not
> > compress well. (The zstd native library is fine, as zstd successfully
> compresses
> > and restores the same input data.) I think it rarely occurs in practice,
> but we
> > ran into this problem when testing with 10TB TPC-DS data. A sample query
> for
> > reproducing the problem is (with hive.execution.mode=llap):
> >
> > select /*+ semi(store_returns, sr_ticket_number, store_sales,
> 34171240)  */
> >    ss_quantity
> > from store_sales, store_returns
> > where ss_ticket_number = sr_ticket_number and
> > sr_returned_date_sk between 2451789 and 2451818
> > limit 100;
> >
> >
> > --- Sungwoo
>

Reply via email to