Spark on Oracle is now available as an open source Apache licensed github repo <https://github.com/oracle/spark-oracle>. Build and deploy it as an extension jar in your Spark clusters.
Use it to combine Apache Spark programs with data in your existing Oracle databases without expensive data copying or query time data movement. The core capability is Optimizer extensions that collapse SQL operator sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical plan parallelism <https://github.com/oracle/spark-oracle/wiki/Query-Splitting>can be controlled to split Spark tasks to operate on Oracle data block ranges, or on resultset pages or on table partitions. We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS queries are completely pushed to Oracle. <https://github.com/oracle/spark-oracle/wiki/TPCDS-Queries> With Spark SQL macros <https://github.com/oracle/spark-oracle/wiki/Spark_SQL_macros> you can write custom Spark UDFs that get translated and pushed as Oracle SQL expressions. With DML pushdown <https://github.com/oracle/spark-oracle/wiki/DML-Support> inserts in Spark SQL get pushed as transactionally consistent inserts/updates on Oracle tables. See Quick Start Guide <https://github.com/oracle/spark-oracle/wiki/Quick-Start-Guide> on how to set up an Oracle free tier ADW instance, load it with TPCDS data and try out the Spark on Oracle Demo <https://github.com/oracle/spark-oracle/wiki/Demo> on your Spark cluster. More details can be found in our blog <https://hbutani.github.io/blogs/blog/Spark_on_Oracle_Blog.html> and the project wiki. <https://github.com/oracle/spark-oracle/wiki> regards, Harish Butani