discussed with one of our user groups, in Spark 3.0, this is called "dynamic partition pruning"
Yang Liu <[email protected]> 于2020年3月6日周五 下午6:12写道: > Hi, > > I am wondering if Calcite will support "lazy optimization" (execution time > optimization / runtime optimization). > > For example, we want to do an inner join between an Elasticsearch table > and a MySQL table, like this: > > WITH logic_table_2 AS > (SELECT _MAP['status'] AS "status", > _MAP['user'] AS "user" > FROM "es"."insight-by-sql-v3" > LIMIT 12345) > SELECT * > FROM "insight_user"."user_tab" AS t1 > JOIN logic_table_2 AS t2 ON t1."email" = t2."user" > WHERE t2."status" = 'fail' > LIMIT 10 > > t2 is a ES table and t1 is a MySQL table, and it may generate a execution > plan like this: > > EnumerableProject(id=[$2], name=[$3], email=[$4], desc=[$5], > is_super=[$6], create_time=[$7], has_all_access=[$8], status=[$0], > user=[$1]) > EnumerableLimit(fetch=[10]) > EnumerableHashJoin(condition=[=($1, $4)], joinType=[inner]) > ElasticsearchToEnumerableConverter > ElasticsearchProject(status=[ITEM($0, 'status')], user=[ITEM($0, > 'user')]) > ElasticsearchFilter(condition=[=(ITEM($0, 'status'), 'fail')]) > ElasticsearchSort(fetch=[12345]) > ElasticsearchTableScan(table=[[es, insight-by-sql-v3]]) > JdbcToEnumerableConverter > JdbcTableScan(table=[[insight_user, user_tab]]) > > since here ES query has a filter, in execution Calcite will do the ES > query first and get the build table, and then do JdbcTableScan and get the > probe table, and do the HashJoin finally. > > But, since this is a INNER JOIN, there is an implicit filter on the later > JdbcTableScan: > ``` t1.email in (select user from t2 where t2.status='fail') ```, if > applying this implicit filter, the dataset we will handle may become > extremely small (save memory) and running much faster since the full > JdbcTableScan is always time-wasting. But since Calcite do the optimization > in planner phase, this dynamic/lazy optimization seems missed ... > > To summarize, serial execution with a "lazy optimization" may be faster > and use less memory than parallel execution with an optimized execution > plan since the former one can reduce dataset we handle. > > Any ideas? > > >
