Re: [I] [DISCUSSION] Add separate crate to cover spark builtin functions [datafusion]

via GitHub Thu, 30 Jan 2025 23:58:23 -0800


shehabgamin commented on issue #5600:
URL: https://github.com/apache/datafusion/issues/5600#issuecomment-2626525138


   > > I will say that we have encountered numerous problems relying on 
downstream DataFusion-based crates,...The issue isn't with the crates 
themselves but arises when it's time to upgrade DataFusion versions, requiring 
us to wait for each crate to update and release a new version.
   >
   > Indeed -- I think the key challenge is mustering enough maintainer 
bandwidth across the crates to keep them up to date / updated quickly.
   
   Thinking through this further, the release cycle bottleneck I mentioned 
earlier wouldn’t apply here. With @linhr and me included as repo owners, 
version updates wouldn't face delays. Comet and Sail would naturally keep the 
codebase synchronized with each DataFusion release since we would have to 
handle these updates during our pre-release testing process anyway.
   
   With that in mind, a joint effort on something in the main DataFusion repo 
or a `datafusion-contrib` repo could both work, and we are open to either 
option.
   
   >
   > However, assuming that most DataFusion contributors would be unhappy about 
running Apache Spark as part of the test suite (I think this is a safe 
assumption), we need a testing plan.
   
   Agreed, we should not bring in Spark as a hard dependency of the DataFusion 
core test suite. 
   
   Because there is no JVM involvement when running a Sail server, it would be 
a relatively straightforward process to port function tests from Sail upstream. 
Functions are tested in three different scopes:
   
   1. **Gold Data Tests**
   Gold Data tests are SQL queries; functions get tested using `Scalar` input. 
We should be able to port over the vast majority of Gold Data tests for 
functions upstream to the `SLT`s.
   The only caveat here is that Sail does not depend on DataFusion’s 
`SqlToRel`. We have to roll out our own implementation since Spark’s SQL 
semantics (Hive dialect) are different from DataFusion’s SQL semantics.
   The implication of this is that some Gold Data tests can’t be ported 
upstream. However, I expect that the number of tests we can’t port over will be 
very small. The vast majority of Gold Data tests specifically for functions are 
very basic queries (e.g., `SELECT ascii(2)`).
   
   2. **Spark Functions Doc Tests**
   These tests use the PySpark client (no JVM is required), and functions 
generally get tested using `Array` input (e.g., by calling a function with some 
DataFrame column as input). We should be able to translate almost all of these 
tests (if not all) to SQL.
   
   3. **Spark Connect Functions Tests**
   Thankfully, these tests are in the minority because they are the most 
impractical to port over. The JVM is required because results are being 
compared to the output from a Spark server.
   
   > In Comet, we rely on integration tests that run Spark with and without 
Comet enabled and compare results. We also run a subset of Spark's test suite 
with Comet enabled.
   >
   > If we move the spark-expr crate into the main DataFusion repository, we 
will have to rely more on unit testing to demonstrate correct behavior. This is 
not a bad thing, of course, but it makes reviews much harder. There is an 
increased risk of PRs being accepted that are not fully compatible with Spark, 
and we won't find out in Comet until we upgrade to the next DataFusion version 
and run the Spark tests.
   
   > Echoing [@andygrove](https://github.com/andygrove) 's point. Also if we 
move the spark-expr to DataFusion core, release management might get harder. 
E.g. we may want to fix spark-expr bugs quickly, but after moving the 
spark-expr to DataFusion, we first need to release DataFusion before releasing 
in Comet (we can use github branch/hashid potentially, but still not straight 
forward even though the code gets streamlined)
   
   These are strong points, and they highlight why maintaining a joint effort 
in a standalone `datafusion-contrib` repo could be the best approach. A 
standalone repo would give us more flexibility in the test setup.
   
   @alamb @andygrove @kazuyukitanimura 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [DISCUSSION] Add separate crate to cover spark builtin functions [datafusion]

Reply via email to