2010YOUY01 opened a new issue, #14535:
URL: https://github.com/apache/datafusion/issues/14535

   ### Is your feature request related to a problem or challenge?
   
   This a project idea for GSoC 2025 
https://github.com/apache/datafusion/issues/14478
   
   `datafusion-sqlancer` is a SQL level fuzz testing implementation for 
DataFusion. https://github.com/apache/datafusion/issues/11030
   
   ## Current implementation status
   `datafusion-sqlancer` has covered partial SQL features, and data types, and 
implemented 3 relatively simple testing oracles[^1]. With occasional manual 
runs, around 50 bugs have been found.
   The implementation is in Java, and it's a fork of the original 
[SQLancer](https://github.com/sqlancer/sqlancer).
   
   ## Why rewrite in Rust
   The SQLancer was first implemented in Java for very good reasons: it has to 
test the effectiveness of several testing oracles on many major databases, JDBC 
is a common interface.
   DataFusion's SQLancer implementation now is done by extending SQLancer 
framework, it has saved us some effort to do CLI parsing, result comparison, 
etc.
   
   There are several reasons I think it's a good idea to rewrite in Rust at 
this point:
   - (major) **Making test oracles also apply to `sqllogictests`**
     `datafusion-sqlancer` consists of two modules: random query generation, 
and property validation for test oracles. Those properties can also be applied 
to enhance existing [SQL 
tests](https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest). 
If we have those properties implemented in Rust, enhancing existing 
`sqllogictest`s would be easier. 
     Now only 3 simple test oracles have been implemented, and I believe there 
are around 10 novel SQL testing algorithms have been proposed, one example is 
`Equivalent Expression 
Transformation`(https://www.usenix.org/conference/osdi24/presentation/jiang). 
EET I think is very suitable to enhance existing SQL tests.
     Overall, I think it's a good time to switch to native rust implementation 
before implementing more complex testing algorithms.
   - **Simplier implementation**
     One thing we simplify is now we don't have to use JDBC to connect the 
testing framework and DataFusion core, configuration fuzzing can be easier, and 
there might be some existing code we can reuse.
   - **More contributors**
     DataFusion ecosystem is mainly in Rust, IMO it would be easier to find 
people to help if the testing framework is written in Rust instead of Java.
     
   
   
   [^1]: https://github.com/apache/datafusion/issues/11030 has a minimal 
example for testing oracle `NoREC`
   
   ### Describe the solution you'd like
   
   See https://github.com/apache/datafusion/issues/11030 for the background
   - Generate random query to a datafusion internal data structure (perhaps 
`Statement`)
   - Implement testing oracles. In order to support also running with existing 
SQL tests, we might want:
     - For query mutation: mutate the query's internal representation, and 
convert it back to SQL string
     - For property check: implement by extending `sqllogictest` framework
   
   ### Describe alternatives you've considered
   
   The project idea proposed above I believe is advanced in terms of 
difficulty. 
   A medium level project can be extending existing implementation with more 
SQL/types support, and implement more test oracles, also with better CI 
integration.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to