In my opinion, Spark is not slow in general, but it is not the fastest execution engine. On the 10TB TPC-DS sequential benchmark, it is quite slow only because of a few outlier queries (like query 24-1 and 24-2). Spark is actually quite efficient in managing concurrent (interactive) workloads.
Sungwoo On Mon, Sep 8, 2025 at 6:43 PM Attila Turoczy <aturo...@cloudera.com> wrote: > Hi, > In practice, LLAP should be faster than Tez. Thanks to the nature of > caching, persistent executors, and in some cases due to vectorized > processing, there's no YARN overhead managing resources between neighbors. > > I’m really happy to see these numbers, they’re very impressive. Excellent > work, Sungwoo Park. I’m not too concerned about whether the cache was > enabled or not; what matters is the end result. Of course, if Tez, Spark, > or any other engine has similar capabilities, then those should be enabled > for a fair and equal comparison. > > I know TPC-DS isn’t perfect for every aspect of comparison, but in this > case it feels like a good sneak peek. By the way, why is Spark 4 the > slowest here? That’s a bit surprising, especially given all the “religion > wars” about Spark being the fastest engine. > > -Attila > > On Mon, Sep 8, 2025 at 11:22 AM lisoda <lis...@yeah.net> wrote: > >> I have a question, LLAP caching currently does not consider cache >> locality. The same sql, compute tasks are currently not scheduled as far as >> possible to nodes that already have data cached. This may result in the >> same copy of data being repeatedly cached N times by multiple nodes. Is >> that really an advantage? >> >> >> ---- Replied Message ---- >> From Butao Zhang<zhangbu...@apache.org> <zhangbu...@apache.org> >> Date 09/08/2025 17:18 >> To user@hive.apache.org >> Cc >> Subject Re: Parameter-tuning Hive-LLAP >> Good point! I think we can run the TPC-DS benchmark multiple times and >> wait until the LLAP cache has sufficiently cached the data onto the SSD. >> Then, we can observe whether the test performance improves. If I remember >> correctly, LLAP has a page where you can check the cache hit rate. >> >> Thanks, >> Butao Zhang >> >> On 2025/09/08 09:12:59 Denys Kuzmenko wrote: >> > hi Sungwoo, >> > >> > I don’t believe the TPC-DS benchmark is the best way to demonstrate the >> advantages of Hive LLAP’s distributed cache. >> > TPC-DS is primarily designed to measure query optimization and overall >> system performance across a wide variety of complex workloads, but it >> doesn’t necessarily highlight scenarios where LLAP’s in-memory caching of >> frequently accessed data provides clear benefits. >> > A more targeted benchmark or workload that emphasizes repeated access >> to the same datasets would be a better fit to showcase the strengths of >> LLAP’s distributed caching capabilities. >> > >> > Regards, >> > Denys >> > >> >