tokoko opened a new issue, #54603: URL: https://github.com/apache/spark/issues/54603
Add a native ADBC (Arrow Database Connectivity) data source to Spark, similar in spirit to the existing JDBC data source but built on the Arrow-native [ADBC](https://arrow.apache.org/adbc/) API. ADBC is a database connectivity API standard under the Apache Arrow project. It provides a vendor-neutral, columnar alternative to JDBC/ODBC specifically designed for analytical workloads. ADBC drivers return result sets as streams of Arrow data rather than row-by-row, which eliminates expensive row-to-columnar conversions. Since spark itself is row-based, the effect is not as dramatic, but still noticeable. Why (now): - There are mature native drivers for PostgreSQL, SQLite, DuckDB, Flight SQL, Snowflake, BigQuery, MySQL, SQL Server, Databricks and so on. It's also very easy to install (and locate) them on a system with [dbc](https://columnar.tech/dbc/) cli tool. - There is now good support for invoking ADBC from Java via JNI bindings to the C++ ADBC driver manager (see [blog](https://columnar.tech/blog/adbc-java/)). This makes it practical to integrate ADBC into Spark's JVM-based architecture. Technically drivers can be implemented in java as well, but the quality of java implementations is pretty low, realistically one will almost almost use a native driver. - ADBC fits well with spark's columnar read support in data source v2. Generating ArrowColumnVectors from adbc is pretty straightforward. It can be a benefit for external spark accelerators like comet and (presumably photon). I have a proof-of-concept implementation at [spark-adbc](https://github.com/tokoko/spark-adbc) that demonstrates the basic read path and not so scientific benchmarks vs jdbc. I'm willing to incrementally implement ADBC data source support upstream if there's interest from the community. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
