Hadoop-3.5.0 has released [1], so we need to update in hive, so that
ain't a blocker from Hadoop side now. I have created a ticket to track
the upgrade [2]

-Ayush

[1] https://lists.apache.org/thread/7dtnbdqrgt30oszd1w1vo7k68z0n7r4b
[2] https://issues.apache.org/jira/browse/HIVE-29543

On Mon, 16 Mar 2026 at 04:32, Ayush Saxena <[email protected]> wrote:
>
> Thanx folks for the pointers on the performance testing. Let me
> discuss this internally and come back with something more concrete.
> One idea that comes to mind is that we are currently using LFS in our
> Docker images; instead, we could potentially use Apache Ozone there.
> They also publish Docker images, so we might be able to leverage
> those.
>
> -Ayush
>
> On Tue, 10 Mar 2026 at 12:31, László Bodor <[email protected]> wrote:
> >
> > Regarding performance benchmarking, we should have a way to test the actual 
> > upstream code. While many - or all - Hive distributors have their own ways 
> > of doing this, we as an open-source community don't. The main limitation is 
> > the testing setup, because our current single-image (HS2) or HS2+HMS Docker 
> > setup is not suitable for this purpose, even though it works wonderfully 
> > for quick local testing.
> > That's what's currently being addressed in the scope of 
> > https://issues.apache.org/jira/browse/HIVE-29492.
> >
> > Regards,
> > Laszlo Bodor
> >
> >
> > On Tue, 10 Mar 2026 at 07:38, kokila narayanan 
> > <[email protected]> wrote:
> >>
> >>
> >> Regarding the performance tracking initiative and Hive-Iceberg workloads, 
> >> one possible starting point could be leveraging the 1 Trillion Row 
> >> Challenge (1TRC) style benchmarks.
> >>
> >> The Impala community has already experimented with something along these 
> >> lines and they have even extended it to work with Iceberg tables as well:
> >> https://github.com/boroknagyz/impala-1trc
> >>
> >> The main query is relatively simple aggregation query:
> >>
> >> SELECT station, min(measure), max(measure), avg(measure)
> >> FROM measurements_1trc
> >> GROUP BY station
> >> ORDER BY station;
> >>
> >> While this benchmark is quite simple and only tests a single type of 
> >> query, it could still be a good starting point. It does not cover the 
> >> wider variety of queries we usually see in Hive workloads (like joins, 
> >> filters, or more complex aggregations), but it is easy to reproduce and 
> >> run.
> >>
> >> With this setup, it could help us get an initial idea of how Hive performs 
> >> on very large Iceberg tables for large-scale scan and aggregation 
> >> workloads.
> >>
> >> I have experimented with this dataset for another feature so I can also 
> >> try running 1BRC/1TRC on Hive and share some initial numbers if that would 
> >> be useful for the release planning.
> >>
> >> Thanks,
> >>
> >> Kokila
> >>
> >>
> >> On Tue, Mar 10, 2026 at 11:43 AM Ayush Saxena <[email protected]> wrote:
> >>>
> >>> Hadoop 3.5.0 is currently in the RC stage (RC0 is already available). I 
> >>> think we can reasonably wait for the final 3.5.0 release, and if time and 
> >>> luck favor us, we could even try giving JDK 25 a shot as well. From a 
> >>> timeline perspective, I don’t think we are too late yet.
> >>>
> >>> More broadly, my expectation—or perhaps wish—for the upcoming release 
> >>> would be to include Hadoop 3.5 + Iceberg V3 + JDK 25 + REST Catalog 
> >>> related changes. Having these in the release would make it more 
> >>> compelling for users to upgrade, rather than it feeling like just another 
> >>> bug-fix release that gives the impression we are in KTLO mode. :-)
> >>>
> >>> As Attila also mentioned above regarding performance tracking, I would 
> >>> definitely like to push that initiative as part of this release. We may 
> >>> not have something perfect right away, but at least we should have a 
> >>> starting point. At the moment, we essentially have nothing in this area. 
> >>> We can always refine the strategy and improve the benchmarks in future 
> >>> releases, but it would be good to have something tangible that we can 
> >>> showcase.
> >>> Personally, I am inclined towards experimenting around Hive–Iceberg 
> >>> workloads, gathering numbers for specific use cases or queries, and 
> >>> drawing some comparisons.
> >>>
> >>> If anyone has already worked on something similar, or has ideas or 
> >>> proposals for how we could approach this, please do share.
> >>>
> >>> -Ayush
> >>>
> >>> On Mon, 9 Feb 2026 at 14:13, Shohei Okumiya <[email protected]> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I'm curious about the remaining blockers. From my perspective,
> >>>> HIVE-29445 and HIVE-29415 might be needed if we include Iceberg v3. I
> >>>> think it's possible to put it off until 4.4. HIVE-29415 requires
> >>>> Iceberg 1.10.2 or 1.11.0 if I understand correctly.
> >>>>
> >>>> Hadoop 3.5 is nice, but it hasn't been released yet. Most likely, we
> >>>> need to keep using 3.4 for a while.
> >>>>
> >>>> If we release 4.3 now, I think we should upgrade the Iceberg library
> >>>> from 1.10.0 to 1.10.1, which has some bug fixes and is not a big
> >>>> effort.
> >>>>
> >>>> Regards,
> >>>> Okumin
> >>>>
> >>>> On Thu, Jan 22, 2026 at 7:44 PM László Bodor <[email protected]> 
> >>>> wrote:
> >>>> >
> >>>> > As to:
> >>>> >
> >>>> > #4 Hadoop 3.5 support would be great. Do we plan to include a newer 
> >>>> > Tez version in 4.5? From what I can see, a significant number of 
> >>>> > changes have recently landed in the repository.
> >>>> >
> >>>> > I don’t think Tez will reach 1.0.0 before Hive 4.5. Given the major 
> >>>> > version milestone, we’re aiming to push more changes and are less 
> >>>> > afraid of breaking things. So unless there’s something blocking, I 
> >>>> > believe Hive 4.5 can continue to use Tez 0.10.5. My personal 
> >>>> > expectation for Tez 1.0.0 is "sometime later this year".
> >>>> >
> >>>> >
> >>>> > On Tue, 20 Jan 2026 at 15:45, Ayush Saxena <[email protected]> wrote:
> >>>> >>
> >>>> >> Hi Attila,
> >>>> >> Regarding:
> >>>> >>
> >>>> >>> As you mentioned, Iceberg v3 is a major part of this release. I 
> >>>> >>> fully agree, and I think we should clearly highlight that Hive is 
> >>>> >>> one of the core engines supporting Iceberg v3. Potentially even 
> >>>> >>> earlier than Trino or other competitors. One thing I would like to 
> >>>> >>> put attention to (coming from discussions with the Apache Impala 
> >>>> >>> team) is that the Vector Delete spec seems to have changed, with 
> >>>> >>> row-lineage becoming a prerequisite. As far as I remember, this is 
> >>>> >>> not yet implemented in Hive. If we want Hive to officially support 
> >>>> >>> Iceberg v3 with vector deletes, we should verify and address this 
> >>>> >>> gap. https://iceberg.apache.org/spec/#row-lineage
> >>>> >>
> >>>> >>
> >>>> >> -----
> >>>> >> I’m not entirely sure what the issue is on the Impala side. Iceberg 
> >>>> >> V3 writes and Deletion Vectors are working correctly in Hive, even 
> >>>> >> with the latest Iceberg version. As far as I know, Iceberg V3 does 
> >>>> >> not allow committing a snapshot unless row IDs are populated. We also 
> >>>> >> have tests in place that cover writes and deletes for Iceberg V3.
> >>>> >>
> >>>> >> We don’t have anything explicit for row lineage because Hive relies 
> >>>> >> on Iceberg writers; we haven’t implemented custom writers. As a 
> >>>> >> result, the Iceberg layer is responsible for populating the row IDs 
> >>>> >> and the next row ID, and that seems to be working as expected.
> >>>> >>
> >>>> >> I tested this locally and verified the metadata files, which clearly 
> >>>> >> contain the row IDs. I’m attaching screenshots of the metadata for 
> >>>> >> reference.
> >>>> >>
> >>>> >> If Impala is observing unexpected behavior and there turns out to be 
> >>>> >> an issue with our implementation, they can report it via a ticket. 
> >>>> >> However, from a fundamentals point of view, this looks correct on the 
> >>>> >> Hive/Iceberg side.
> >>>> >>
> >>>> >> -Ayush
> >>>> >>
> >>>> >>
> >>>> >> On Tue, 20 Jan 2026 at 19:24, Denys Kuzmenko <[email protected]> 
> >>>> >> wrote:
> >>>> >>>
> >>>> >>> Hi everyone,
> >>>> >>>
> >>>> >>> +1 on collecting the performance numbers.
> >>>> >>>
> >>>> >>> I’d like to propose a few additional items to consider:
> >>>> >>>
> >>>> >>> #1 REST Catalog HA and vended credentials support
> >>>> >>> - HIVE-29391,
> >>>> >>> - HIVE-29228
> >>>> >>>
> >>>> >>> #2 Federated Catalog support
> >>>> >>> - HIVE-28879
> >>>> >>>
> >>>> >>> #3 Kubernetes manifests / Helm chart for Apache Hive deployment
> >>>> >>>
> >>>> >>> #4 New V3 items (that I am aware of)
> >>>> >>>
> >>>> >>> 1. VARIANT shredding:
> >>>> >>>   - HIVE-29287,
> >>>> >>>   - HIVE-29354
> >>>> >>>
> >>>> >>> 2. Z-order support for Iceberg tables:
> >>>> >>>   - HIVE-29132
> >>>> >>>
> >>>> >>> Best regards,
> >>>> >>> Denys

Reply via email to