Re: Use Apache ORC in Apache Spark 2.3

Andrew Ash Thu, 10 Aug 2017 15:24:06 -0700

I would support moving ORC from sql/hive -> sql/core because it brings me
one step closer to eliminating Hive from my Spark distribution by removing
-Phive at build time.


On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun <dh...@hortonworks.com>
wrote:

> Thank you again for coming and reviewing this PR.
>
>
>
> So far, we discussed the followings.
>
>
>
> 1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
>
>    - `sql/core` module gives more benefit than `sql/hive`.
>
>    - Apache ORC library (`no-hive` version) is a general and resonably
> small library designed for non-hive apps.
>
>
>
> 2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
>
>    - The previous #17980 , #17924, and #17943 are the complete examples
> containing this PR.
>
>    - This PR is focusing on dependency only.
>
>
>
> 3. `Why don't we then create a separate orc module? Just copy a few of the
> files over?` (@rxin)
>
>    -  Apache ORC library is the same with most of other data sources(CSV,
> JDBC, JSON, PARQUET, TEXT) which live inside `sql/core`
>
>    - It's better to use as a library instead of copying ORC files because
> Apache ORC shaded jar has many files. We had better depend on Apache ORC
> community's effort until an unavoidable reason for copying occurs.
>
>
>
> 4. `I do worry in the future whether ORC would bring in a lot more jars`
> (@rxin)
>
>    - The ORC core library's dependency tree is aggressively kept as small
> as possible. I've gone through and excluded unnecessary jars from our
> dependencies. I also kick back pull requests that add unnecessary new
> dependencies. (@omalley)
>
>
>
> 5. `In the long term, Spark should move to using only the vectorized
> reader in ORC's core” (@omalley)
>
> - Of course.
>
>
>
> I’ve been waiting for new comments and discussion since last week.
>
> Apparently, there is no further comments except the last comment(5) from
> Owen in this week.
>
>
>
> Please give your opinion if you think we need some change on the current
> PR (as-is).
>
> FYI, there is one LGTM on the PR (as-is) and no -1 so far.
>
>
>
> Thank you again for supporting new ORC improvement in Apache Spark.
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
>
>
> *From: *Dong Joon Hyun <dh...@hortonworks.com>
> *Date: *Friday, August 4, 2017 at 8:05 AM
> *To: *"dev@spark.apache.org" <dev@spark.apache.org>
> *Cc: *Apache Spark PMC <priv...@spark.apache.org>
> *Subject: *Use Apache ORC in Apache Spark 2.3
>
>
>
> Hi, All.
>
>
>
> Apache Spark always has been a fast and general engine, and
>
> supports Apache ORC inside `sql/hive` module with Hive dependency since
> Spark 1.4.X (SPARK-2883).
>
> However, there are many open issues about `Feature parity for ORC with
> Parquet (SPARK-20901)` as of today.
>
>
>
> With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get
> the following benefits.
>
>
>
>     - Usability:
>
>         * Users can use `ORC` data sources without hive module (-Phive)
> like `Parquet` format.
>
>
>
>     - Stability & Maintanability:
>
>         * ORC 1.4 already has many fixes.
>
>         * In the future, Spark can upgrade ORC library independently from
> Hive
>            (similar to Parquet library, too)
>
>         * Eventually, reduce the dependecy on old Hive 1.2.1.
>
>
>
>     - Speed:
>
>         * Last but not least, Spark can use both Spark `ColumnarBatch` and
> ORC `RowBatch` together
>
>           which means full vectorization support.
>
>
>
> First of all, I'd love to improve Apache Spark in the following steps in
> the time frame of Spark 2.3.
>
>
>
>     - SPARK-21422: Depend on Apache ORC 1.4.0
>
>     - SPARK-20682: Add a new faster ORC data source based on Apache ORC
>
>     - SPARK-20728: Make ORCFileFormat configurable between sql/hive and
> sql/core
>
>     - SPARK-16060: Vectorized Orc Reader
>
>
>
> I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
>
> but the PRs seems to need more attention of PMC since this is an important
> change.
>
> Since the discussion on Apache Spark 2.3 cadence is already started this
> week,
>
> I thought it’s a best time to ask you about this.
>
>
>
> Could anyone of you help me to proceed ORC improvement in Apache Spark
> community?
>
>
>
> Please visit the minimal PR and JIRA issue as a starter.
>
>
>
>    - https://github.com/apache/spark/pull/18640
>    - https://issues.apache.org/jira/browse/SPARK-21422
>
>
>
> Thank you in advance.
>
>
>
> Bests,
>
> Dongjoon Hyun.
>

Re: Use Apache ORC in Apache Spark 2.3

Reply via email to