Hadoop-3.5.0 has released [1], so we need to update in hive, so that ain't a blocker from Hadoop side now. I have created a ticket to track the upgrade [2]
-Ayush [1] https://lists.apache.org/thread/7dtnbdqrgt30oszd1w1vo7k68z0n7r4b [2] https://issues.apache.org/jira/browse/HIVE-29543 On Mon, 16 Mar 2026 at 04:32, Ayush Saxena <[email protected]> wrote: > > Thanx folks for the pointers on the performance testing. Let me > discuss this internally and come back with something more concrete. > One idea that comes to mind is that we are currently using LFS in our > Docker images; instead, we could potentially use Apache Ozone there. > They also publish Docker images, so we might be able to leverage > those. > > -Ayush > > On Tue, 10 Mar 2026 at 12:31, László Bodor <[email protected]> wrote: > > > > Regarding performance benchmarking, we should have a way to test the actual > > upstream code. While many - or all - Hive distributors have their own ways > > of doing this, we as an open-source community don't. The main limitation is > > the testing setup, because our current single-image (HS2) or HS2+HMS Docker > > setup is not suitable for this purpose, even though it works wonderfully > > for quick local testing. > > That's what's currently being addressed in the scope of > > https://issues.apache.org/jira/browse/HIVE-29492. > > > > Regards, > > Laszlo Bodor > > > > > > On Tue, 10 Mar 2026 at 07:38, kokila narayanan > > <[email protected]> wrote: > >> > >> > >> Regarding the performance tracking initiative and Hive-Iceberg workloads, > >> one possible starting point could be leveraging the 1 Trillion Row > >> Challenge (1TRC) style benchmarks. > >> > >> The Impala community has already experimented with something along these > >> lines and they have even extended it to work with Iceberg tables as well: > >> https://github.com/boroknagyz/impala-1trc > >> > >> The main query is relatively simple aggregation query: > >> > >> SELECT station, min(measure), max(measure), avg(measure) > >> FROM measurements_1trc > >> GROUP BY station > >> ORDER BY station; > >> > >> While this benchmark is quite simple and only tests a single type of > >> query, it could still be a good starting point. It does not cover the > >> wider variety of queries we usually see in Hive workloads (like joins, > >> filters, or more complex aggregations), but it is easy to reproduce and > >> run. > >> > >> With this setup, it could help us get an initial idea of how Hive performs > >> on very large Iceberg tables for large-scale scan and aggregation > >> workloads. > >> > >> I have experimented with this dataset for another feature so I can also > >> try running 1BRC/1TRC on Hive and share some initial numbers if that would > >> be useful for the release planning. > >> > >> Thanks, > >> > >> Kokila > >> > >> > >> On Tue, Mar 10, 2026 at 11:43 AM Ayush Saxena <[email protected]> wrote: > >>> > >>> Hadoop 3.5.0 is currently in the RC stage (RC0 is already available). I > >>> think we can reasonably wait for the final 3.5.0 release, and if time and > >>> luck favor us, we could even try giving JDK 25 a shot as well. From a > >>> timeline perspective, I don’t think we are too late yet. > >>> > >>> More broadly, my expectation—or perhaps wish—for the upcoming release > >>> would be to include Hadoop 3.5 + Iceberg V3 + JDK 25 + REST Catalog > >>> related changes. Having these in the release would make it more > >>> compelling for users to upgrade, rather than it feeling like just another > >>> bug-fix release that gives the impression we are in KTLO mode. :-) > >>> > >>> As Attila also mentioned above regarding performance tracking, I would > >>> definitely like to push that initiative as part of this release. We may > >>> not have something perfect right away, but at least we should have a > >>> starting point. At the moment, we essentially have nothing in this area. > >>> We can always refine the strategy and improve the benchmarks in future > >>> releases, but it would be good to have something tangible that we can > >>> showcase. > >>> Personally, I am inclined towards experimenting around Hive–Iceberg > >>> workloads, gathering numbers for specific use cases or queries, and > >>> drawing some comparisons. > >>> > >>> If anyone has already worked on something similar, or has ideas or > >>> proposals for how we could approach this, please do share. > >>> > >>> -Ayush > >>> > >>> On Mon, 9 Feb 2026 at 14:13, Shohei Okumiya <[email protected]> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I'm curious about the remaining blockers. From my perspective, > >>>> HIVE-29445 and HIVE-29415 might be needed if we include Iceberg v3. I > >>>> think it's possible to put it off until 4.4. HIVE-29415 requires > >>>> Iceberg 1.10.2 or 1.11.0 if I understand correctly. > >>>> > >>>> Hadoop 3.5 is nice, but it hasn't been released yet. Most likely, we > >>>> need to keep using 3.4 for a while. > >>>> > >>>> If we release 4.3 now, I think we should upgrade the Iceberg library > >>>> from 1.10.0 to 1.10.1, which has some bug fixes and is not a big > >>>> effort. > >>>> > >>>> Regards, > >>>> Okumin > >>>> > >>>> On Thu, Jan 22, 2026 at 7:44 PM László Bodor <[email protected]> > >>>> wrote: > >>>> > > >>>> > As to: > >>>> > > >>>> > #4 Hadoop 3.5 support would be great. Do we plan to include a newer > >>>> > Tez version in 4.5? From what I can see, a significant number of > >>>> > changes have recently landed in the repository. > >>>> > > >>>> > I don’t think Tez will reach 1.0.0 before Hive 4.5. Given the major > >>>> > version milestone, we’re aiming to push more changes and are less > >>>> > afraid of breaking things. So unless there’s something blocking, I > >>>> > believe Hive 4.5 can continue to use Tez 0.10.5. My personal > >>>> > expectation for Tez 1.0.0 is "sometime later this year". > >>>> > > >>>> > > >>>> > On Tue, 20 Jan 2026 at 15:45, Ayush Saxena <[email protected]> wrote: > >>>> >> > >>>> >> Hi Attila, > >>>> >> Regarding: > >>>> >> > >>>> >>> As you mentioned, Iceberg v3 is a major part of this release. I > >>>> >>> fully agree, and I think we should clearly highlight that Hive is > >>>> >>> one of the core engines supporting Iceberg v3. Potentially even > >>>> >>> earlier than Trino or other competitors. One thing I would like to > >>>> >>> put attention to (coming from discussions with the Apache Impala > >>>> >>> team) is that the Vector Delete spec seems to have changed, with > >>>> >>> row-lineage becoming a prerequisite. As far as I remember, this is > >>>> >>> not yet implemented in Hive. If we want Hive to officially support > >>>> >>> Iceberg v3 with vector deletes, we should verify and address this > >>>> >>> gap. https://iceberg.apache.org/spec/#row-lineage > >>>> >> > >>>> >> > >>>> >> ----- > >>>> >> I’m not entirely sure what the issue is on the Impala side. Iceberg > >>>> >> V3 writes and Deletion Vectors are working correctly in Hive, even > >>>> >> with the latest Iceberg version. As far as I know, Iceberg V3 does > >>>> >> not allow committing a snapshot unless row IDs are populated. We also > >>>> >> have tests in place that cover writes and deletes for Iceberg V3. > >>>> >> > >>>> >> We don’t have anything explicit for row lineage because Hive relies > >>>> >> on Iceberg writers; we haven’t implemented custom writers. As a > >>>> >> result, the Iceberg layer is responsible for populating the row IDs > >>>> >> and the next row ID, and that seems to be working as expected. > >>>> >> > >>>> >> I tested this locally and verified the metadata files, which clearly > >>>> >> contain the row IDs. I’m attaching screenshots of the metadata for > >>>> >> reference. > >>>> >> > >>>> >> If Impala is observing unexpected behavior and there turns out to be > >>>> >> an issue with our implementation, they can report it via a ticket. > >>>> >> However, from a fundamentals point of view, this looks correct on the > >>>> >> Hive/Iceberg side. > >>>> >> > >>>> >> -Ayush > >>>> >> > >>>> >> > >>>> >> On Tue, 20 Jan 2026 at 19:24, Denys Kuzmenko <[email protected]> > >>>> >> wrote: > >>>> >>> > >>>> >>> Hi everyone, > >>>> >>> > >>>> >>> +1 on collecting the performance numbers. > >>>> >>> > >>>> >>> I’d like to propose a few additional items to consider: > >>>> >>> > >>>> >>> #1 REST Catalog HA and vended credentials support > >>>> >>> - HIVE-29391, > >>>> >>> - HIVE-29228 > >>>> >>> > >>>> >>> #2 Federated Catalog support > >>>> >>> - HIVE-28879 > >>>> >>> > >>>> >>> #3 Kubernetes manifests / Helm chart for Apache Hive deployment > >>>> >>> > >>>> >>> #4 New V3 items (that I am aware of) > >>>> >>> > >>>> >>> 1. VARIANT shredding: > >>>> >>> - HIVE-29287, > >>>> >>> - HIVE-29354 > >>>> >>> > >>>> >>> 2. Z-order support for Iceberg tables: > >>>> >>> - HIVE-29132 > >>>> >>> > >>>> >>> Best regards, > >>>> >>> Denys
