Blizzara commented on PR #13888: URL: https://github.com/apache/datafusion/pull/13888#issuecomment-2560214683
I like the idea, more testing the better! We already have some Substrait TCP testing, but I think that's from "known Substrait" -> DF, so it only tests the consumer, while this would test the producer as well. I think roughly there should be three levels of "equality": 1. a plan goes through roundtrip without crashing, but doesn't match equality (this doesn't necessarily mean the result is correct, but it shows that there aren't any totally unsupported LogicalPlan nodes) 2. "optimized" plan equality - calling optimize on both plans before comparing, like you do. 3. "unoptimized" plan equality - comparing the plans directly. I feel like (2) should be "easier" check than (3), though the tests you have here show that some plans would pass unoptimized but not optimized, which confuses me. Overall, I think (3) is nice to have, but not something I'd spent too much effort for - it doesn't really bring that much benefits to users compared to (2). (2) is nice compared to (1) in that it guarantees a bit more correctness, but in many cases it's unfortunately tough to support even (2) (even in theory, roundtrip isn't lossless, since multiple DF plans may result in the same Substrait plan, and the other way around too). So I'd suggest splitting the TPCH tests into 4 categories: - those that pass (3) - those that pass (2) - those that pass (1) - maybe eyeballed into a) those that look correct after roundtrip even if the plan doesn't fully match, and b) those that look incorrect, if any - those that fail to convert for the roundtrip That allows merging the tests, preventing regressions, and tracking needed improvements :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org