Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Sanjeev Mishra Sat, 27 Jun 2020 16:58:39 -0700

I have large amount of json files that Spark can read in 36 seconds but
Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone
have any idea what is going on? Is there any configuration problem with
Spark 3.0.


Here are the details:

*Spark 2.4*

Summary Metrics for 2203 Completed Tasks
<http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle>
MetricMin25th percentileMedian75th percentileMax
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 Aggregated Metrics by Executor
Show 204060100All entries
Search:
Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded
TasksBlacklisted
driver 10.0.0.8:49159 36 s 2203 0 0 2203 false


*Spark 3.0*

Summary Metrics for 8 Completed Tasks
<http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle>
MetricMin25th percentileMedian75th percentileMax
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8
MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 Aggregated Metrics by Executor
Show 204060100All entries
Search:
Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded
TasksBlacklistedInput Size / Records
driver 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

[image: Screenshot 2020-06-27 16.30.26.png]

Spark 3.0 DAG

[image: Screenshot 2020-06-27 16.32.32.png]

Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Reply via email to