Hi Michael,

This sounds like a good idea. Can you open a JIRA to track this?

My initial feedback on your proposal would be that you might want to
express the no_collapse at the expression level and not at the plan level.

HTH

On Thu, Apr 20, 2017 at 3:31 PM, Michael Styles <michael.sty...@shopify.com>
wrote:

> Hello,
>
> I am in the process of putting together a PR that introduces a new hint
> called NO_COLLAPSE. This hint is essentially identical to Oracle's NO_MERGE
> hint.
>
> Let me first give an example of why I am proposing this.
>
> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"])
> df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"]))
> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"),
> df2["ua"].browser_version.alias("c2"))
> df3.explain(True)
>
> == Parsed Logical Plan ==
> 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS
> c2#91]
> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]
>    +- LogicalRDD [id#80L, user_agent#81]
>
> == Analyzed Logical Plan ==
> c1: string, c2: string
> Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS
> c2#91]
> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]
>    +- LogicalRDD [id#80L, user_agent#81]
>
> == Optimized Logical Plan ==
> Project [UDF(user_agent#81).device_form_factor AS c1#90,
> UDF(user_agent#81).browser_version AS c2#91]
> +- LogicalRDD [id#80L, user_agent#81]
>
> == Physical Plan ==
> *Project [UDF(user_agent#81).device_form_factor AS c1#90,
> UDF(user_agent#81).browser_version AS c2#91]
> +- Scan ExistingRDD[id#80L,user_agent#81]
>
> user_agent_details is a user-defined function that returns a struct. As
> can be seen from the generated query plan, the function is being executed
> multiple times which could lead to performance issues. This is due to the
> CollapseProject optimizer rule that collapses adjacent projections.
>
> I'm proposing a hint that prevent the optimizer from collapsing adjacent
> projections. A new function called 'no_collapse' would be introduced for
> this purpose. Consider the following example and generated query plan.
>
> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"])
> df2 = F.no_collapse(df1.withColumn("ua", user_agent_details(df1["user_
> agent"])))
> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"),
> df2["ua"].browser_version.alias("c2"))
> df3.explain(True)
>
> == Parsed Logical Plan ==
> 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS
> c2#76]
> +- NoCollapseHint
>    +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69]
>       +- LogicalRDD [id#64L, user_agent#65]
>
> == Analyzed Logical Plan ==
> c1: string, c2: string
> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
> c2#76]
> +- NoCollapseHint
>    +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69]
>       +- LogicalRDD [id#64L, user_agent#65]
>
> == Optimized Logical Plan ==
> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
> c2#76]
> +- NoCollapseHint
>    +- Project [UDF(user_agent#65) AS ua#69]
>       +- LogicalRDD [id#64L, user_agent#65]
>
> == Physical Plan ==
> *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
> c2#76]
> +- *Project [UDF(user_agent#65) AS ua#69]
>    +- Scan ExistingRDD[id#64L,user_agent#65]
>
> As can be seen from the query plan, the user-defined function is now
> evaluated once per row.
>
> I would like to get some feedback on this proposal.
>
> Thanks.
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] <http://databricks.com/>


[image: Join Databricks at Spark Summit 2017 in San Francisco, the world's
largest event for the Apache Spark community.] <http://ssum.it/2mKQ3te>

Reply via email to