I would support moving ORC from sql/hive -> sql/core because it brings me one step closer to eliminating Hive from my Spark distribution by removing -Phive at build time.
On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun <dh...@hortonworks.com> wrote: > Thank you again for coming and reviewing this PR. > > > > So far, we discussed the followings. > > > > 1. `Why are we adding this to core? Why not just the hive module?` (@rxin) > > - `sql/core` module gives more benefit than `sql/hive`. > > - Apache ORC library (`no-hive` version) is a general and resonably > small library designed for non-hive apps. > > > > 2. `Can we add smaller amount of new code to use this, too?` (@kiszk) > > - The previous #17980 , #17924, and #17943 are the complete examples > containing this PR. > > - This PR is focusing on dependency only. > > > > 3. `Why don't we then create a separate orc module? Just copy a few of the > files over?` (@rxin) > > - Apache ORC library is the same with most of other data sources(CSV, > JDBC, JSON, PARQUET, TEXT) which live inside `sql/core` > > - It's better to use as a library instead of copying ORC files because > Apache ORC shaded jar has many files. We had better depend on Apache ORC > community's effort until an unavoidable reason for copying occurs. > > > > 4. `I do worry in the future whether ORC would bring in a lot more jars` > (@rxin) > > - The ORC core library's dependency tree is aggressively kept as small > as possible. I've gone through and excluded unnecessary jars from our > dependencies. I also kick back pull requests that add unnecessary new > dependencies. (@omalley) > > > > 5. `In the long term, Spark should move to using only the vectorized > reader in ORC's core” (@omalley) > > - Of course. > > > > I’ve been waiting for new comments and discussion since last week. > > Apparently, there is no further comments except the last comment(5) from > Owen in this week. > > > > Please give your opinion if you think we need some change on the current > PR (as-is). > > FYI, there is one LGTM on the PR (as-is) and no -1 so far. > > > > Thank you again for supporting new ORC improvement in Apache Spark. > > > > Bests, > > Dongjoon. > > > > > > *From: *Dong Joon Hyun <dh...@hortonworks.com> > *Date: *Friday, August 4, 2017 at 8:05 AM > *To: *"dev@spark.apache.org" <dev@spark.apache.org> > *Cc: *Apache Spark PMC <priv...@spark.apache.org> > *Subject: *Use Apache ORC in Apache Spark 2.3 > > > > Hi, All. > > > > Apache Spark always has been a fast and general engine, and > > supports Apache ORC inside `sql/hive` module with Hive dependency since > Spark 1.4.X (SPARK-2883). > > However, there are many open issues about `Feature parity for ORC with > Parquet (SPARK-20901)` as of today. > > > > With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get > the following benefits. > > > > - Usability: > > * Users can use `ORC` data sources without hive module (-Phive) > like `Parquet` format. > > > > - Stability & Maintanability: > > * ORC 1.4 already has many fixes. > > * In the future, Spark can upgrade ORC library independently from > Hive > (similar to Parquet library, too) > > * Eventually, reduce the dependecy on old Hive 1.2.1. > > > > - Speed: > > * Last but not least, Spark can use both Spark `ColumnarBatch` and > ORC `RowBatch` together > > which means full vectorization support. > > > > First of all, I'd love to improve Apache Spark in the following steps in > the time frame of Spark 2.3. > > > > - SPARK-21422: Depend on Apache ORC 1.4.0 > > - SPARK-20682: Add a new faster ORC data source based on Apache ORC > > - SPARK-20728: Make ORCFileFormat configurable between sql/hive and > sql/core > > - SPARK-16060: Vectorized Orc Reader > > > > I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release, > > but the PRs seems to need more attention of PMC since this is an important > change. > > Since the discussion on Apache Spark 2.3 cadence is already started this > week, > > I thought it’s a best time to ask you about this. > > > > Could anyone of you help me to proceed ORC improvement in Apache Spark > community? > > > > Please visit the minimal PR and JIRA issue as a starter. > > > > - https://github.com/apache/spark/pull/18640 > - https://issues.apache.org/jira/browse/SPARK-21422 > > > > Thank you in advance. > > > > Bests, > > Dongjoon Hyun. >