Hi Luca, I m collecting logical n physical plan. So that it will be helpful to find the root cause of this issue
On Mon, 20 Dec 2021, 16:46 Luca Canali, <luca.can...@cern.ch> wrote: > Hi Senthil, > > > > I have just run a couple of quick tests for TPCDS Q4, using the TPCDS > schema created at scale 1500 that I have on a Hadoop/YARN cluster, and was > not able to reproduce the difference in execution time between Spark 2 and > Spark 3 that you report in your mail. > > This is the Spark config I used: > > bin/spark-shell --master yarn --driver-memory 8g --executor-cores 10 > --executor-memory 50g --conf spark.dynamicAllocation.enabled=false > --num-executors 20 > > > > This is how I ran the tests: > > > > ``` > > val path="/project/spark/TPCDS/tpcds_1500_parquet_1.10.1/" > > > > val > tables=List("catalog_returns","catalog_sales","inventory","store_returns","store_sales","web_returns","web_sales", > "call_center","catalog_page","customer","customer_address","customer_demographics","date_dim","household_demographics","income_band","item","promotion","reason","ship_mode","store","time_dim","warehouse","web_page","web_site") > > > > for (t <- tables) { > > println(s"Creating temporary view $t") > > spark.read.parquet(path + t).createOrReplaceTempView(t) > > } > > > > val q4="""…""" > > // SQL from > https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q4.sql > > > > spark.time(sql(q4).collect) // note q4 result set is only 100 rows > > ``` > > > > Spark 2.4.5: > > Time taken: 256812 ms > > Time taken: 226571 ms > > Time taken: 305508 ms > > > > Spark 3.1.2 > > spark.time(sql(q4).collect) > > Time taken: 235356 ms > > Time taken: 236284 ms > > > > Best, > > Luca > > > > *From:* Senthil Kumar <sen...@gmail.com> > *Sent:* Monday, December 20, 2021 10:20 > *To:* Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com> > *Cc:* dev <dev@spark.apache.org> > *Subject:* Re: Spark 3 is Slower than Spark 2 for TPCDS Q04 query. > > > > Also we checked that we have already backported > https://issues.apache.org/jira/browse/SPARK-33557 jira. > > > > On Mon, Dec 20, 2021 at 11:08 AM Senthil Kumar <sen...@gmail.com> wrote: > > @abhishek. We use spark 3.1* > > > > On Mon, 20 Dec 2021, 09:50 Rao, Abhishek (Nokia - IN/Bangalore), < > abhishek....@nokia.com> wrote: > > Hi Senthil, > > > > Which version of Spark 3 are we using? We had this kind of observation > with Spark 3.0.2 and 3.1.x, but then we figured out that we had configured > big value for spark.network.timeout and this value was not taking effect in > all releases prior to 3.0.2. > > This was fixed as part of > https://issues.apache.org/jira/browse/SPARK-33557. Because we had > configured big value for spark.network.timeout, this was resulting in TPCDS > queries taking long time when tried with Spark 3.0.2 and 3.1.x. Once we > corrected it, we observed that the queries were executed much faster. > > > > Thanks and Regards, > > Abhishek > > > > *From:* Senthil Kumar <sen...@gmail.com> > *Sent:* Sunday, December 19, 2021 11:58 PM > *To:* dev <dev@spark.apache.org> > *Subject:* Spark 3 is Slower than Spark 2 for TPCDS Q04 query. > > > > Hi All, > > We are comparing Spark 2.4.5 and Spark 3(without enabling spark 3 > additional features) with TPCDS queries and found that Spark 3's > performance is reduced to at-least 30-40% compared to Spark 2.4.5. > > > > Eg. > > Data size used 1TB > > > Spark 2.4.5 finishes the Q4 in 1.5 min, but Spark 3.* takes at-least 2.5 > min. > > > > Note: We tested this in the same cluster with the same size of data. And > we ensured that parameters we passed are one and the same for SPark 2.4* > and Spark 3*. > > > > It will be helpful, if any one you also encountered the same issue in your > benchmarking activities? If so, pls share your input on what could be the > reason behind this poor performance. > > > > -- > > Senthil kumar > > > > > -- > > Senthil kumar >