Re: Parquet support (HIVE-5783)

Brock Noland Sun, 16 Feb 2014 19:42:26 -0800

Hi Gunther,

Please find my response inline.

On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner <gunt...@apache.org> wrote:
> I read through the ticket, patch and documentation

Thank you very much for reading through these items!

> and would like to
> suggest some changes.

There was ample time to suggest these changes prior to commit. The
JIRA was created three months ago, and the title you object to and the
patch was up there over two months ago.

> As far as I can tell this basically adds parquet SerDes to hive, but the
> file format remains external to hive. There is no way for hive devs to
> makes changes, fix bugs add, change datatypes, add features to parquet
> itself.

As stated in many locations including the JIRA discussed here, we
shouldn't be picking winner/loser file formats. We use many external
libraries, none of which, all Hive developers have the ability to
modify. For example most Hive developers do not have the ability to
modify Sequence File. Tez is also an external library which few Hive
developers can change.

> So:
>
> - I suggest we document it as one of the built-in SerDes and not as a
> native format like here:
> https://cwiki.apache.org/confluence/display/Hive/Parquet (and here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
> - I vote for the jira to say "Add parquet SerDes to Hive" and not "Native
> support"

The change provides the ability to create a parquet table with Hive,
natively. Therefore I don't see the issue you have with the word
native.

> - I think we should revert the change to the grammar to allow "STORED AS
> PARQUET" until we have a mechanism to do that for all SerDes, i.e.: someone
> picks up: HIVE-5976. (I also don't think this actually works properly
> unless we bundle parquet in hive-exec, which I don't think we want.)

Again, you could have provided this feedback many moons ago. I am
personally interested in HIVE-5976 but it's orthogonal to this issue.
That change just makes it easier and cleaner to add STORED AS
keywords. The contributors of the Parquet integration are not required
to fix Hive. That is our job.

> - We should revert the deprecated classes (At least I don't understand how
> a first drop needs to add deprecated stuff)

The deprecated classes are shells (no actual code) to support existing
users of Parquet, of which there are many. I see no justification for
impacting existing users when the workaround is trivial and
non-impacting to any other user.

> In general though, I'm also confused on why adding this SerDe to the hive
> code base is beneficial. Seems to me that that just makes upgrading
> Parquet, bug fixing, etc more difficult by tying a SerDe release to a Hive
> release. To me that outweighs the benefit of a slightly more involved setup
> of Hive + serde in the cluster.

The Hive APIs, which are not clearly defined, have changed often in
the past few releases making maintaining a file format extremely
difficult. For example, 0.12 and 0.13 break most if not all external
code bases.

However, beyond that, the community felt it was beneficial to make
Parquet easier to use. If you are not interested in Parquet then
ignore it as this change does not impact you. Tez integration is
something which does not interest myself and many other Hive
developers. Indeed other than a few cursory reviews and a few times
where I championed the refactoring you guys were doing in order to
support Tez, I have ignored the Tez work.

Sincerely,
Brock

Re: Parquet support (HIVE-5783)

Reply via email to