shehabgamin commented on issue #5600: URL: https://github.com/apache/datafusion/issues/5600#issuecomment-2626525138
> > I will say that we have encountered numerous problems relying on downstream DataFusion-based crates,...The issue isn't with the crates themselves but arises when it's time to upgrade DataFusion versions, requiring us to wait for each crate to update and release a new version. > > Indeed -- I think the key challenge is mustering enough maintainer bandwidth across the crates to keep them up to date / updated quickly. Thinking through this further, the release cycle bottleneck I mentioned earlier wouldn’t apply here. With @linhr and me included as repo owners, version updates wouldn't face delays. Comet and Sail would naturally keep the codebase synchronized with each DataFusion release since we would have to handle these updates during our pre-release testing process anyway. With that in mind, a joint effort on something in the main DataFusion repo or a `datafusion-contrib` repo could both work, and we are open to either option. > > However, assuming that most DataFusion contributors would be unhappy about running Apache Spark as part of the test suite (I think this is a safe assumption), we need a testing plan. Agreed, we should not bring in Spark as a hard dependency of the DataFusion core test suite. Because there is no JVM involvement when running a Sail server, it would be a relatively straightforward process to port function tests from Sail upstream. Functions are tested in three different scopes: 1. **Gold Data Tests** Gold Data tests are SQL queries; functions get tested using `Scalar` input. We should be able to port over the vast majority of Gold Data tests for functions upstream to the `SLT`s. The only caveat here is that Sail does not depend on DataFusion’s `SqlToRel`. We have to roll out our own implementation since Spark’s SQL semantics (Hive dialect) are different from DataFusion’s SQL semantics. The implication of this is that some Gold Data tests can’t be ported upstream. However, I expect that the number of tests we can’t port over will be very small. The vast majority of Gold Data tests specifically for functions are very basic queries (e.g., `SELECT ascii(2)`). 2. **Spark Functions Doc Tests** These tests use the PySpark client (no JVM is required), and functions generally get tested using `Array` input (e.g., by calling a function with some DataFrame column as input). We should be able to translate almost all of these tests (if not all) to SQL. 3. **Spark Connect Functions Tests** Thankfully, these tests are in the minority because they are the most impractical to port over. The JVM is required because results are being compared to the output from a Spark server. > In Comet, we rely on integration tests that run Spark with and without Comet enabled and compare results. We also run a subset of Spark's test suite with Comet enabled. > > If we move the spark-expr crate into the main DataFusion repository, we will have to rely more on unit testing to demonstrate correct behavior. This is not a bad thing, of course, but it makes reviews much harder. There is an increased risk of PRs being accepted that are not fully compatible with Spark, and we won't find out in Comet until we upgrade to the next DataFusion version and run the Spark tests. > Echoing [@andygrove](https://github.com/andygrove) 's point. Also if we move the spark-expr to DataFusion core, release management might get harder. E.g. we may want to fix spark-expr bugs quickly, but after moving the spark-expr to DataFusion, we first need to release DataFusion before releasing in Comet (we can use github branch/hashid potentially, but still not straight forward even though the code gets streamlined) These are strong points, and they highlight why maintaining a joint effort in a standalone `datafusion-contrib` repo could be the best approach. A standalone repo would give us more flexibility in the test setup. @alamb @andygrove @kazuyukitanimura -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org