Hello,
I recently used spark3.2 to do a test based on the TPC-DS dataset, and the
entire TPC-DS data scale is 1TB(located in HDFS). But I encountered a problem
that I couldn't understand, and I hope to get your help.
The SQL statement tested is as follows:
select count(*) from (
select distinct c_last_name, c_first_name, d_date
from store_sales, data_dim, customer
where store_sales.ss_sold_date_sk = date_dim.d_date_sk
and store_sales.ss_customer_sk = customer.c_customer_sk
and d_month_seq between 1200 and 1200 + 11
)hot_cust
limit 100;
Three tables are used here, and their sizes are: store_sales(387.97GB),
date_dim(9.84GB),customer(1.52GB)
When submitting the application, the driver and executor parameters are set as
follows:
DRIVER_OPTIONS="--master spark://xx.xx.xx.xx:7078 --total-executor-cores 48
--conf spark.driver.memory=100g --conf spark.default.parallelism=100 --conf
spark.sql.adaptive.enabled=false"
EXECUTOR_OPTIONS="--executor-cores 12 --conf spark.executor.memory=50g --conf
spark.task.cpus=1 --conf spark.sql.shuffle.partitions=100 --conf
spark.sql.crossJoin.enabled=true"
The Spark cluster is built on a physical node with only one worker, and the
submitted tasks are in StandAlone mode. A total of 56 cores, 376GB of memory. I
kept executor-num=4 unchanged (that is, the total executors remained
unchanged), and adjusted executor-cores=2, 4, 8, 12. The corresponding
application time was: 96s, 66s, 50s, 49s. It was found that the total-
executor-cores >=32, the time-consuming is almost unchanged.
I monitored CPU utilization, IO, and found no bottlenecks. Then I analyzed the
task time in the longest stage of each application in the spark web UI
interface(The longest time-consuming stage in the four test cases is stage4,
and the corresponding time-consuming: 48s, 28s, 22s, 23s). We see that with the
increase of cpu cores, the task time in stage4 will also increase. The average
task time is: 3s, 4s, 5s, 7s. And each executor takes longer to run the first
batch of tasks than subsequent tasks. The more cores there are, the greater the
time gap:
Test-Case-ID total-executor-cores executor-cores total application time stage4
time average task time The first batch of tasks takes time The remaining
batch tasks take time
1 8 2 96s 48s 7s 5-6s 3-4s
2 16 4 66s 28s 5s 7-8s ~4s
3 32 8 50s 23s 4s 10-12s ~5s
4 48 12 49s 22s 3s 14-15s ~6s
This result confuses me, why after increasing the number of cores, the
execution time of tasks becomes longer. I compared the data size processed by
each task in different cases, they are all consistent. Or why is the execution
efficiency almost unchanged after the number of cores increases to a certain
number.
Looking forward to your reply.
15927907...@163.com