[PR] Add substrait tpch round trip tests from sql query [datafusion]

via GitHub Mon, 23 Dec 2024 07:27:40 -0800


robtandy opened a new pull request, #13888:
URL: https://github.com/apache/datafusion/pull/13888

## Which issue does this PR close?

I've been investigating and experimenting with federating tpch query plans
and sending the federated portion downstream encoded as substrait protos (Side
node, is this useful or worth sharing?). When doing this I discovered the bug
in issue #13860 . CC @alamb @Blizzara as we discussed the PR for that issue,
and your input would be appreciated here as well.

In the course of that work, I discovered that there isn't testing coverage
for round trip from TPCH query -> logical plan -> substrait -> logical plan.

My understanding of how do this is the following:
- generate optimized logical plan from tpch query
- encode as substrait
- send elsewhere
- decode into logical plan
- optimize this logical plan as our local table providers are different than
the ones where the previous optimization was made

After this the recreated plan should be the same and produce the same result
when executed. This is my understanding from looking at tests for round trip
in to substrait for arbitrary queries from
https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L1242-L1266

This has worked for me when federating portions of the plan, and out of
curiosity I wanted to know if it works for the entire plan. It does not work
for the entire plan in most cases as these tests show. In most cases it looks
like we need to support more of the `LogicalPlan` nodes when serializing to
substrait.

So, I've added this round trip for all TPCH queries. I generated the TPCH
queries from https://duckdb.org/docs/extensions/tpch.html. I then generated
serialized json schemas which can be loaded into a context with a fake
`TableProvider`.

The other way I could see this happening is to skip optimization before, and
compared the unoptimized plan upon deserialization for correctness. Then
optimize and execute.

I've added tests for both paths, but if the unoptimized way is not intended
to work, then we can remove those.

If my understanding of how this works is correct, then these tests are
probably useful. If it is not, then I hope that conversation here can help me
understand it and I'll improve the tests. 😄

## Rationale for this change

More clearly highlight and identify missing features in substrait
serialization.

## What changes are included in this PR?

tests added

## Are these changes tested?

CI will test them

## Are there any user-facing changes?

No, just tests.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add substrait tpch round trip tests from sql query [datafusion]

Reply via email to