I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.
Here are the details: *Spark 2.4* Summary Metrics for 2203 Completed Tasks <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle> MetricMin25th percentileMedian75th percentileMax Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms Showing 1 to 2 of 2 entries Aggregated Metrics by Executor Show 204060100All entries Search: Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded TasksBlacklisted driver 10.0.0.8:49159 36 s 2203 0 0 2203 false *Spark 3.0* Summary Metrics for 8 Completed Tasks <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle> MetricMin25th percentileMedian75th percentileMax Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min GC Time 3 s 3 s 3 s 4 s 4 s Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624 Showing 1 to 3 of 3 entries Aggregated Metrics by Executor Show 204060100All entries Search: Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded TasksBlacklistedInput Size / Records driver 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999 The DAG is also different Spark 2.0 DAG [image: Screenshot 2020-06-27 16.30.26.png] Spark 3.0 DAG [image: Screenshot 2020-06-27 16.32.32.png]
