GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/11647
[SPARK-13658][SQL] BooleanSimplification rule is slow with large boolean
expressions
JIRA: https://issues.apache.org/jira/browse/SPARK-13658
## What changes were proposed in this pull request?
Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to
filter out the partitions, the optimizer rule BooleanSimplification take about
2 seconds (it use lots of sematicsEqual, which require copy the whole tree).
It will great if we could speedup it.
[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql
How to speed up it:
When we ask the canonicalized expression in `Expression`, it calls
`Canonicalize.execute` on itself. `Canonicalize.execute` basically transforms
up all expressions included in this expression. However, we don't keep the
canonicalized versions for these children expressions. So in next time we ask
the canonicalized expressions for the children expressions (e.g.,
`BooleanSimplification`), we will rerun `Canonicalize.execute` on each of them.
It wastes much time.
By forcing the children expressions to get and keep their canonicalized
versions first, we can avoid re-canonicalize these expressions.
I simply benchmark it with an expression which is part of the where clause
in TPCDS Q3:
val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int,
'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int)
val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk ===
'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) &&
(('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) ||
('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) || ('ss_sold_date_sk
> 2416085 && 'ss_sold_date_sk < 2416115) || ('ss_sold_date_sk > 2416450 &&
'ss_sold_date_sk < 2416480) || ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk
< 2416846) || ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) ||
('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) || ('ss_sold_date_sk
> 2417911 && 'ss_sold_date_sk < 2417941) || ('ss_sold_date_sk > 2418277 &&
'ss_sold_date_sk < 2418307) || ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk
< 2418672) || ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) ||
('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) || ('ss_sold_date_sk
> 2419738 && 'ss_sold_date_sk < 2419768) || ('ss_sold_date_sk > 2420103 &&
'ss_sold_date_sk < 24201
33) || ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) ||
('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) || ('ss_sold_date_sk
> 2421199 && 'ss_sold_date_sk < 2421229) || ('ss_sold_date_sk > 2421564 &&
'ss_sold_date_sk < 2421594) || ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk
< 2421959) || ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) ||
('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) || ('ss_sold_date_sk
> 2423025 && 'ss_sold_date_sk < 2423055) || ('ss_sold_date_sk > 2423390 &&
'ss_sold_date_sk < 2423420) || ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk
< 2423785) || ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) ||
('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) || ('ss_sold_date_sk
> 2424851 && 'ss_sold_date_sk < 2424881) || ('ss_sold_date_sk > 2425216 &&
'ss_sold_date_sk < 2425246) || ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk
< 2425612) || ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) |
| ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) ||
('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) || ('ss_sold_date_sk
> 2427043 && 'ss_sold_date_sk < 2427073) || ('ss_sold_date_sk > 2427408 &&
'ss_sold_date_sk < 2427438) || ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk
< 2427803) || ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) ||
('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) || ('ss_sold_date_sk
> 2428869 && 'ss_sold_date_sk < 2428899) || ('ss_sold_date_sk > 2429234 &&
'ss_sold_date_sk < 2429264) || ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk
< 2429629) || ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) ||
('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) || ('ss_sold_date_sk
> 2430695 && 'ss_sold_date_sk < 2430725) || ('ss_sold_date_sk > 2431060 &&
'ss_sold_date_sk < 2431090) || ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk
< 2431456) || ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) || ('s
s_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) || ('ss_sold_date_sk >
2432521 && 'ss_sold_date_sk < 2432551) || ('ss_sold_date_sk > 2432887 &&
'ss_sold_date_sk < 2432917) || ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk
< 2433282) || ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) ||
('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) || ('ss_sold_date_sk
> 2434348 && 'ss_sold_date_sk < 2434378) || ('ss_sold_date_sk > 2434713 &&
'ss_sold_date_sk < 2434743)))
val plan = testRelation.where(input).analyze
val actual = Optimize.execute(plan)
With this patch:
352 milliseconds
346 milliseconds
340 milliseconds
Without this patch:
585 milliseconds
880 milliseconds
677 milliseconds
## How was this patch tested?
Existing tests should pass.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 improve-expr-canonicalize
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11647.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11647
----
commit c6e84152de4cdd65aa90234cb5554ea669c89a5d
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-03-11T05:09:50Z
Force expression children to be canonicalized before being canonicalized
itself.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]