That makes sense. Thanks Alexey!

From: Alexey Romanenko <aromanenko....@gmail.com>
Date: Tuesday, July 6, 2021 at 10:33 AM
To: Tao Li <t...@zillow.com>
Cc: Yuchu Cao <yuc...@trulia.com>, "user@beam.apache.org" <user@beam.apache.org>
Subject: Re: Beam Calcite SQL SparkRunner Performance

I think it’s quiet expected since Spark may push down the SQL query (or some 
parts of the query) to IO or/and RDD level and apply different type of 
optimisations there, whereas Beam SQL translates an SQL query into the general 
Beam pipeline which then is translated by SparkRunner into Spark pipeline (in 
your case).

So, potentially we can also have some push-downs here, like Schema projection 
that we already have for ParquetIO. I believe that “filters" can be the next 
step but joins could be tricky since now they are based on other Beam 
PTransforms.

—
Alexey


On 6 Jul 2021, at 04:39, Tao Li <t...@zillow.com<mailto:t...@zillow.com>> wrote:

@Alexey Romanenko<mailto:aromanenko....@gmail.com> do you have any thoughts on 
this issue? Looks like the dag compiled by “Beam on Spark” has many more stages 
than native spark, which results in more shuffling and thus longer processing 
time.

From: Yuchu Cao <yuc...@trulia.com<mailto:yuc...@trulia.com>>
Date: Monday, June 28, 2021 at 8:09 PM
To: "user@beam.apache.org<mailto:user@beam.apache.org>" 
<user@beam.apache.org<mailto:user@beam.apache.org>>
Cc: Tao Li <t...@zillow.com<mailto:t...@zillow.com>>
Subject: Beam Calcite SQL SparkRunner Performance

Hi Beam community,

We are trying to compare performance of Beam SQL on Spark with native Spark. 
The query that used for the comparison is below. The nexmark_bid schema is in 
parquet format and file size is about 35GB.
SELECT auction, price FROM nexmark_bid WHERE auction = 1007 OR auction = 1020 
OR auction = 2001 OR auction = 2019 OR auction = 1087

And we noticed that the Beam Spark jobs execution had 16 stages in total, while 
Native spark job only had 2 stages; and the native Spark job is 7 times faster 
than Beam Spark job with the same resource allocation settings in spark-submit 
commands.

Any reason why Beam Spark job execution created more stages and 
mapPartitionRDDs than native Spark? Can the performance of such query be 
improved in any way ? Thank you!

Beam Spark job stages and stage 11 DAG:

<image001.png>
<image002.png>



Native Spark job stages and stage 1 DAG:
<image003.png>
<image004.png>

Reply via email to