Re: Beam Calcite SQL SparkRunner Performance

Alexey Romanenko Tue, 06 Jul 2021 10:34:10 -0700

I think it’s quiet expected since Spark may push down the SQL query (or some 
parts of the query) to IO or/and RDD level and apply different type of 
optimisations there, whereas Beam SQL translates an SQL query into the general 
Beam pipeline which then is translated by SparkRunner into Spark pipeline (in 
your case).


So, potentially we can also have some push-downs here, like Schema projection 
that we already have for ParquetIO. I believe that “filters" can be the next 
step but joins could be tricky since now they are based on other Beam 
PTransforms.

—
Alexey

> On 6 Jul 2021, at 04:39, Tao Li <t...@zillow.com> wrote:
> 
> @Alexey Romanenko <mailto:aromanenko....@gmail.com> do you have any thoughts 
> on this issue? Looks like the dag compiled by “Beam on Spark” has many more 
> stages than native spark, which results in more shuffling and thus longer 
> processing time.
>  
> From: Yuchu Cao <yuc...@trulia.com <mailto:yuc...@trulia.com>>
> Date: Monday, June 28, 2021 at 8:09 PM
> To: "user@beam.apache.org <mailto:user@beam.apache.org>" 
> <user@beam.apache.org <mailto:user@beam.apache.org>>
> Cc: Tao Li <t...@zillow.com <mailto:t...@zillow.com>>
> Subject: Beam Calcite SQL SparkRunner Performance 
>  
> Hi Beam community, 
>  
> We are trying to compare performance of Beam SQL on Spark with native Spark. 
> The query that used for the comparison is below. The nexmark_bid schema is in 
> parquet format and file size is about 35GB.
> SELECT auction, price FROM nexmark_bid WHERE auction = 1007 OR auction = 1020 
> OR auction = 2001 OR auction = 2019 OR auction = 1087
>  
> And we noticed that the Beam Spark jobs execution had 16 stages in total, 
> while Native spark job only had 2 stages; and the native Spark job is 7 times 
> faster than Beam Spark job with the same resource allocation settings in 
> spark-submit commands. 
>  
> Any reason why Beam Spark job execution created more stages and 
> mapPartitionRDDs than native Spark? Can the performance of such query be 
> improved in any way ? Thank you!
>  
> Beam Spark job stages and stage 11 DAG: 
>  
> <image001.png>
> <image002.png>
>  
>  
>  
> Native Spark job stages and stage 1 DAG: 
> <image003.png>
> <image004.png>

Re: Beam Calcite SQL SparkRunner Performance

Reply via email to