sgrebnov opened a new pull request, #13132: URL: https://github.com/apache/datafusion/pull/13132
## Which issue does this PR close?
With filters pushdown optimization enabled, `TableScan` filters and
`LogicalPlan::Filter` w/o dedicated Projection incorrectly apply after join
statement which leads to incorrect result, for example
Original query (TPC-H Q13)
```sql
select
c_custkey,
count(o_orderkey)
from
customer left join orders on
c_custkey = o_custkey
and o_comment not like '%special%requests%'
group by
c_custkey
```
is unparsed as below. You can see that original filter from Join is unparsed
after join as `where`. The difference is that original query will contain
customer records where there is no matching order and the second version does
not (filter is applied after join).
```sql
select
"customer"."c_custkey",
count("orders"."o_orderkey")
from
"customer"
left join "orders" on
("customer"."c_custkey" = "orders"."o_custkey")
where
"orders"."o_comment" not like '%special%requests%'
group by
"customer"."c_custkey"
order by "customer"."c_custkey" asc
```
## What changes are included in this PR?
The PR improves unparsing by determining if the join is a simple table scan
(not a subquery that will be translated to a full subquery (`SELECT .. FROM
..`)) so that filters must be applied directly to the join. In the case of a
full subquery, filters will be applied within the subquery, and this works
correctly. This fixes TPC-H Q12 (filters on the left join were missing/not
applied at all) and TPC-H Q13 (the filter was applied after the join instead of
during/before, as shown above).
```console
SELECT "customer"."c_custkey", count("orders"."o_orderkey") FROM "customer"
LEFT JOIN "orders" ON (("customer"."c_custkey" = "orders"."o_custkey") AND
"orders"."o_comment" NOT LIKE '%special%requests%') GROUP BY
"customer"."c_custkey"
sql> explain select
c_custkey,
count(o_orderkey)
from
customer left join orders on
c_custkey = o_custkey
and o_comment not like '%special%requests%'
group by
c_custkey;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | BytesProcessedNode
|
| | Federated
|
| | Projection: customer.c_custkey, count(orders.o_orderkey)
|
| | Aggregate: groupBy=[[customer.c_custkey]],
aggr=[[count(orders.o_orderkey)]]
|
| | Left Join: Filter: customer.c_custkey =
orders.o_custkey
|
| | TableScan: customer
|
| | TableScan: orders, full_filters=[orders.o_comment
NOT LIKE Utf8("%special%requests%")]
|
| physical_plan | BytesProcessedExec
|
| | SchemaCastScanExec
|
| | VirtualExecutionPlan name=postgres
compute_context=host=Tcp("localhost"),port=6432,db=tpch_sf1,user=postgres,
sql=SELECT customer.c_custkey, count(orders.o_orderkey) FROM customer LEFT JOIN
orders ON ((customer.c_custkey = orders.o_custkey) AND orders.o_comment NOT
LIKE '%special%requests%') GROUP BY customer.c_custkey rewritten_sql=SELECT
"customer"."c_custkey", count("orders"."o_orderkey") FROM "customer" LEFT JOIN
"orders" ON (("customer"."c_custkey" = "orders"."o_custkey") AND
"orders"."o_comment" NOT LIKE '%special%requests%') GROUP BY
"customer"."c_custkey" |
| |
|
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
**Note:** The alternative approach considered was always unparsing join
sub-nodes by adding a wildcard projection if needed. However, it requires
introducing additional alias for a subquery (and updating corresponding upper
nodes to match alias added) as not all engines support adding subquery with the
same name as original table, for example `JOIN (SELECT * FROM "orders" WHERE
"orders"."o_comment" not like '%special%requests%') as orders ..` - `as orders`
needs to be replaced with new alias for most engines
## Are these changes tested?
Added unit test, tested as part of TPC-H and TPC-DS queries unparsing by
https://github.com/spiceai/spiceai (running benchmarks with some filters
pushdown optimizations enabled)
## Are there any user-facing changes?
Fixes unparsing issues related to incorrectly generated 'JOIN' filters
(incorrect query result).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
