Hi, All. Apache Spark always has been a fast and general engine, and since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive dependency.
With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and get some benefits. - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which means full vectorization support. - Stability: Apache ORC 1.4.0 already has many fixes and we can depend on ORC community effort in the future. - Usability: Users can use `ORC` data sources without hive module (-Phive) - Maintainability: Reduce the Hive dependency and eventually remove some old legacy code from `sql/hive` module. As a first step, I made a PR adding a new ORC data source into `sql/core` module. https://github.com/apache/spark/pull/17924 (+ 3,691 lines, -0) Could you give some opinions on this approach? Bests, Dongjoon.