robtandy opened a new pull request, #13888: URL: https://github.com/apache/datafusion/pull/13888
## Which issue does this PR close? I've been investigating and experimenting with federating tpch query plans and sending the federated portion downstream encoded as substrait protos (Side node, is this useful or worth sharing?). When doing this I discovered the bug in issue #13860 . CC @alamb @Blizzara as we discussed the PR for that issue, and your input would be appreciated here as well. In the course of that work, I discovered that there isn't testing coverage for round trip from TPCH query -> logical plan -> substrait -> logical plan. My understanding of how do this is the following: - generate optimized logical plan from tpch query - encode as substrait - send elsewhere - decode into logical plan - optimize this logical plan as our local table providers are different than the ones where the previous optimization was made After this the recreated plan should be the same and produce the same result when executed. This is my understanding from looking at tests for round trip in to substrait for arbitrary queries from https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L1242-L1266 This has worked for me when federating portions of the plan, and out of curiosity I wanted to know if it works for the entire plan. It does not work for the entire plan in most cases as these tests show. In most cases it looks like we need to support more of the `LogicalPlan` nodes when serializing to substrait. So, I've added this round trip for all TPCH queries. I generated the TPCH queries from https://duckdb.org/docs/extensions/tpch.html. I then generated serialized json schemas which can be loaded into a context with a fake `TableProvider`. The other way I could see this happening is to skip optimization before, and compared the unoptimized plan upon deserialization for correctness. Then optimize and execute. I've added tests for both paths, but if the unoptimized way is not intended to work, then we can remove those. If my understanding of how this works is correct, then these tests are probably useful. If it is not, then I hope that conversation here can help me understand it and I'll improve the tests. 😄 ## Rationale for this change More clearly highlight and identify missing features in substrait serialization. ## What changes are included in this PR? tests added ## Are these changes tested? CI will test them ## Are there any user-facing changes? No, just tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org