tokoko opened a new issue, #54603:
URL: https://github.com/apache/spark/issues/54603

   Add a native ADBC (Arrow Database Connectivity) data source to Spark, 
similar in spirit to the existing JDBC data source but built on the 
Arrow-native [ADBC](https://arrow.apache.org/adbc/) API.
   
   ADBC is a database connectivity API standard under the Apache Arrow project. 
It provides a vendor-neutral, columnar alternative to JDBC/ODBC specifically 
designed for analytical workloads. ADBC drivers return result sets as streams 
of Arrow data rather than row-by-row, which eliminates expensive 
row-to-columnar conversions. Since spark itself is row-based, the effect is not 
as dramatic, but still noticeable.   
   
   Why (now):
   - There are mature native drivers for PostgreSQL, SQLite, DuckDB, Flight 
SQL, Snowflake, BigQuery, MySQL, SQL Server, Databricks and so on. It's also 
very easy to install (and locate) them on a system with 
[dbc](https://columnar.tech/dbc/) cli tool.
   - There is now good support for invoking ADBC from Java via JNI bindings to 
the C++ ADBC driver manager (see 
[blog](https://columnar.tech/blog/adbc-java/)). This makes it practical to 
integrate ADBC into Spark's JVM-based architecture. Technically drivers can be 
implemented in java as well, but the quality of java implementations is pretty 
low, realistically one will almost almost use a native driver.
   - ADBC fits well with spark's columnar read support in data source v2. 
Generating ArrowColumnVectors from adbc is pretty straightforward. It can be a 
benefit for external spark accelerators like comet and (presumably photon). 
   
   I have a proof-of-concept implementation at 
[spark-adbc](https://github.com/tokoko/spark-adbc) that demonstrates the basic 
read path and not so scientific benchmarks vs jdbc. I'm willing to 
incrementally implement ADBC data source support upstream if there's interest 
from the community.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to