Hi,
Can Please provide some more info about your Spark cluster setup. You
mentioned Hadoop as the underlying storage. I assume that there is data
locality between your Spark cluster and the
the underlying hadoop.
In your SQL statement below
select count(*) from (
select *distinct* c_l
More cores is finishing faster as expected. My guess is that you are
getting more parallelism overall and that speeds things up. However with
more tasks executing concurrently on one machine, you are getting some
contention, so it's possible more tasks are taking longer - a little I/O
contention, C
Hello,
I recently used spark3.2 to do a test based on the TPC-DS dataset, and the
entire TPC-DS data scale is 1TB(located in HDFS). But I encountered a problem
that I couldn't understand, and I hope to get your help.
The SQL statement tested is as follows:
select count(*) from (