In my opinion, Spark is not slow in general, but it is not the fastest
execution engine. On the 10TB TPC-DS sequential benchmark, it is quite slow
only because of a few outlier queries (like query 24-1 and 24-2). Spark is
actually quite efficient in managing concurrent (interactive) workloads.

Sungwoo


On Mon, Sep 8, 2025 at 6:43 PM Attila Turoczy <aturo...@cloudera.com> wrote:

> Hi,
> In practice, LLAP should be faster than Tez. Thanks to the nature of
> caching, persistent executors, and in some cases due to vectorized
> processing, there's no YARN overhead managing resources between neighbors.
>
> I’m really happy to see these numbers, they’re very impressive. Excellent
> work, Sungwoo Park. I’m not too concerned about whether the cache was
> enabled or not; what matters is the end result. Of course, if Tez, Spark,
> or any other engine has similar capabilities, then those should be enabled
> for a fair and equal comparison.
>
> I know TPC-DS isn’t perfect for every aspect of comparison, but in this
> case it feels like a good sneak peek. By the way, why is Spark 4 the
> slowest here? That’s a bit surprising, especially given all the “religion
> wars” about Spark being the fastest engine.
>
> -Attila
>
> On Mon, Sep 8, 2025 at 11:22 AM lisoda <lis...@yeah.net> wrote:
>
>> I have a question, LLAP caching currently does not consider cache
>> locality. The same sql, compute tasks are currently not scheduled as far as
>> possible to nodes that already have data cached. This may result in the
>> same copy of data being repeatedly cached N times by multiple nodes. Is
>> that really an advantage?
>>
>>
>> ---- Replied Message ----
>> From Butao Zhang<zhangbu...@apache.org> <zhangbu...@apache.org>
>> Date 09/08/2025 17:18
>> To user@hive.apache.org
>> Cc
>> Subject Re: Parameter-tuning Hive-LLAP
>> Good point! I think we can run the TPC-DS benchmark multiple times and
>> wait until the LLAP cache has sufficiently cached the data onto the SSD.
>> Then, we can observe whether the test performance improves. If I remember
>> correctly, LLAP has a page where you can check the cache hit rate.
>>
>> Thanks,
>> Butao Zhang
>>
>> On 2025/09/08 09:12:59 Denys Kuzmenko wrote:
>> > hi Sungwoo,
>> >
>> > I don’t believe the TPC-DS benchmark is the best way to demonstrate the
>> advantages of Hive LLAP’s distributed cache.
>> > TPC-DS is primarily designed to measure query optimization and overall
>> system performance across a wide variety of complex workloads, but it
>> doesn’t necessarily highlight scenarios where LLAP’s in-memory caching of
>> frequently accessed data provides clear benefits.
>> > A more targeted benchmark or workload that emphasizes repeated access
>> to the same datasets would be a better fit to showcase the strengths of
>> LLAP’s distributed caching capabilities.
>> >
>> > Regards,
>> > Denys
>> >
>>
>

Reply via email to