robtandy opened a new pull request, #13888:
URL: https://github.com/apache/datafusion/pull/13888

   ## Which issue does this PR close?
   
   I've been investigating and experimenting with federating tpch query plans 
and sending the federated portion downstream encoded as substrait protos (Side 
node, is this useful or worth sharing?).   When doing this I discovered the bug 
in issue #13860 .   CC @alamb @Blizzara as we discussed the PR for that issue, 
and your input would be appreciated here as well.
   
   In the course of that work, I discovered that there isn't testing coverage 
for round trip from TPCH query -> logical plan -> substrait -> logical plan.
   
   My understanding of how do this is the following:
   - generate optimized logical plan from tpch query
   - encode as substrait
   - send elsewhere
   - decode into logical plan
   - optimize this logical plan as our local table providers are different than 
the ones where the previous optimization was made
   
   After this the recreated plan should be the same and produce the same result 
when executed.  This is my understanding from looking at tests for round trip 
in to substrait for arbitrary queries  from 
https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L1242-L1266
   
   This has worked for me when federating portions of the plan, and out of 
curiosity I wanted to know if it works for the entire plan.   It does not work 
for the entire plan in most cases as these tests show.   In most cases it looks 
like we need to support more of the `LogicalPlan` nodes when serializing to 
substrait.  
   
   So, I've added this round trip for all TPCH queries.  I generated the TPCH 
queries from https://duckdb.org/docs/extensions/tpch.html.   I then generated 
serialized json schemas which can be loaded into a context with a fake 
`TableProvider`.  
   
   The other way I could see this happening is to skip optimization before, and 
compared the unoptimized plan upon deserialization for correctness.   Then 
optimize and execute. 
   
   I've added tests for both paths, but if the unoptimized way is not intended 
to work, then we can remove those.
   
   If my understanding of how this works is correct, then these tests are 
probably useful.  If it is not, then I hope that conversation here can help me 
understand it and I'll improve the tests. 😄 
   
   ## Rationale for this change
   
   More clearly highlight and identify missing features in substrait 
serialization.
   
   ## What changes are included in this PR?
   
   tests added
   
   ## Are these changes tested?
   
   CI will test them
   
   ## Are there any user-facing changes?
   
   No, just tests.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to