Hi Michael, This sounds like a good idea. Can you open a JIRA to track this?
My initial feedback on your proposal would be that you might want to express the no_collapse at the expression level and not at the plan level. HTH On Thu, Apr 20, 2017 at 3:31 PM, Michael Styles <michael.sty...@shopify.com> wrote: > Hello, > > I am in the process of putting together a PR that introduces a new hint > called NO_COLLAPSE. This hint is essentially identical to Oracle's NO_MERGE > hint. > > Let me first give an example of why I am proposing this. > > df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) > df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) > df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), > df2["ua"].browser_version.alias("c2")) > df3.explain(True) > > == Parsed Logical Plan == > 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS > c2#91] > +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] > +- LogicalRDD [id#80L, user_agent#81] > > == Analyzed Logical Plan == > c1: string, c2: string > Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS > c2#91] > +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] > +- LogicalRDD [id#80L, user_agent#81] > > == Optimized Logical Plan == > Project [UDF(user_agent#81).device_form_factor AS c1#90, > UDF(user_agent#81).browser_version AS c2#91] > +- LogicalRDD [id#80L, user_agent#81] > > == Physical Plan == > *Project [UDF(user_agent#81).device_form_factor AS c1#90, > UDF(user_agent#81).browser_version AS c2#91] > +- Scan ExistingRDD[id#80L,user_agent#81] > > user_agent_details is a user-defined function that returns a struct. As > can be seen from the generated query plan, the function is being executed > multiple times which could lead to performance issues. This is due to the > CollapseProject optimizer rule that collapses adjacent projections. > > I'm proposing a hint that prevent the optimizer from collapsing adjacent > projections. A new function called 'no_collapse' would be introduced for > this purpose. Consider the following example and generated query plan. > > df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) > df2 = F.no_collapse(df1.withColumn("ua", user_agent_details(df1["user_ > agent"]))) > df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), > df2["ua"].browser_version.alias("c2")) > df3.explain(True) > > == Parsed Logical Plan == > 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS > c2#76] > +- NoCollapseHint > +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] > +- LogicalRDD [id#64L, user_agent#65] > > == Analyzed Logical Plan == > c1: string, c2: string > Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS > c2#76] > +- NoCollapseHint > +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] > +- LogicalRDD [id#64L, user_agent#65] > > == Optimized Logical Plan == > Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS > c2#76] > +- NoCollapseHint > +- Project [UDF(user_agent#65) AS ua#69] > +- LogicalRDD [id#64L, user_agent#65] > > == Physical Plan == > *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS > c2#76] > +- *Project [UDF(user_agent#65) AS ua#69] > +- Scan ExistingRDD[id#64L,user_agent#65] > > As can be seen from the query plan, the user-defined function is now > evaluated once per row. > > I would like to get some feedback on this proposal. > > Thanks. > > -- Herman van Hövell Software Engineer Databricks Inc. hvanhov...@databricks.com +31 6 420 590 27 databricks.com [image: http://databricks.com] <http://databricks.com/> [image: Join Databricks at Spark Summit 2017 in San Francisco, the world's largest event for the Apache Spark community.] <http://ssum.it/2mKQ3te>