Reading data slows down when Spark3.0 uses multiple cpu cores

1650996069 Sun, 08 Nov 2020 18:38:45 -0800

Hello, I recently encountered a problem that confuses me when using spark3.0.


I used the tpcx-bb dataset (200GB) and executed Query#5 in it. The SQL will 
read about 65.7GB of table data. 

Query#5  is as  
follows(https://github.com/NVIDIA/spark-rapids/blob/branch-0.3/integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/tpcxbb/TpcxbbLikeSpark.scala):

---------------------------------------------------------

The execution script is:
#! /bin/bash
$Spark_Home/bin/spark-submit \
--class org.tpcxBB \
--master spark://10.3.68.116:7077 \
--executor-memory 40g \
--total-executors-cores 48 \

--executor-cores 8 \
--conf spark.task.cpus=1 \
--conf spark.driver.memory=40g \
--conf spark.driver.maxResultSize=60g \
--conf spark.executor.memory=40g \
--conf spark.sql.shuffle.partitions=100 \
/home/runJars/Tpcxbb-1.0.jar "/user/root/benchmarks/data-200g/data" "5"
---------------------------------------------------------


There are three machines in the spark cluster, each with 56 cpu cores and 360GB 
of memory allocated.
But what is strange is that  when I increase the number of cpu cores, the 
overall execution time of  Query#5 is reduced that can be seen from the history 
server web UI, but  the average time of tasks (510 in total) responsible for 
reading data  increases significantly.
When cpu cores=32, the average task time is 7s
When cpu cores=48, the average task time is 17s
When cpu cores=96, the average task time is 20s
Entering  the task log analysis, it is found that reading the same data file, 
the  task time consumption increases with the number of cpu cores.
What is the reason for this?

Next,  when I kept the total executors unchanged and still increased the cpu  
cores, I found that the result was the same. Of course, the  executor-memory 
and driver-memory are enough.
Can you explain the reason for this?
Thank U~
??Excuse me, the same email I sent through 15927907...@163.com on Friday, and 
it shows that the sending to d...@spark.apache.org is successful, why can??t I 
see the email I sent at http://apache-spark-developers-list.1001551.n3.nabble. 
com???

Reading data slows down when Spark3.0 uses multiple cpu cores

Reply via email to