As far as I know, Hive supports consistent split generation, so you can set
up so that tasks reading the same input split are scheduled on the same
daemon (hive.tez.input.generate.consistent.splits).

Sungwoo

On Mon, Sep 8, 2025 at 6:23 PM lisoda <lis...@yeah.net> wrote:

> I have a question, LLAP caching currently does not consider cache
> locality. The same sql, compute tasks are currently not scheduled as far as
> possible to nodes that already have data cached. This may result in the
> same copy of data being repeatedly cached N times by multiple nodes. Is
> that really an advantage?
>
>
> ---- Replied Message ----
> From Butao Zhang<zhangbu...@apache.org> <zhangbu...@apache.org>
> Date 09/08/2025 17:18
> To user@hive.apache.org
> Cc
> Subject Re: Parameter-tuning Hive-LLAP
> Good point! I think we can run the TPC-DS benchmark multiple times and
> wait until the LLAP cache has sufficiently cached the data onto the SSD.
> Then, we can observe whether the test performance improves. If I remember
> correctly, LLAP has a page where you can check the cache hit rate.
>
> Thanks,
> Butao Zhang
>
> On 2025/09/08 09:12:59 Denys Kuzmenko wrote:
> > hi Sungwoo,
> >
> > I don’t believe the TPC-DS benchmark is the best way to demonstrate the
> advantages of Hive LLAP’s distributed cache.
> > TPC-DS is primarily designed to measure query optimization and overall
> system performance across a wide variety of complex workloads, but it
> doesn’t necessarily highlight scenarios where LLAP’s in-memory caching of
> frequently accessed data provides clear benefits.
> > A more targeted benchmark or workload that emphasizes repeated access to
> the same datasets would be a better fit to showcase the strengths of LLAP’s
> distributed caching capabilities.
> >
> > Regards,
> > Denys
> >
>

Reply via email to