Hi Gunther, Please find my response inline.
On Sat, Feb 15, 2014 at 5:52 PM, Gunther Hagleitner <gunt...@apache.org> wrote: > I read through the ticket, patch and documentation Thank you very much for reading through these items! > and would like to > suggest some changes. There was ample time to suggest these changes prior to commit. The JIRA was created three months ago, and the title you object to and the patch was up there over two months ago. > As far as I can tell this basically adds parquet SerDes to hive, but the > file format remains external to hive. There is no way for hive devs to > makes changes, fix bugs add, change datatypes, add features to parquet > itself. As stated in many locations including the JIRA discussed here, we shouldn't be picking winner/loser file formats. We use many external libraries, none of which, all Hive developers have the ability to modify. For example most Hive developers do not have the ability to modify Sequence File. Tez is also an external library which few Hive developers can change. > So: > > - I suggest we document it as one of the built-in SerDes and not as a > native format like here: > https://cwiki.apache.org/confluence/display/Hive/Parquet (and here: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual) > - I vote for the jira to say "Add parquet SerDes to Hive" and not "Native > support" The change provides the ability to create a parquet table with Hive, natively. Therefore I don't see the issue you have with the word native. > - I think we should revert the change to the grammar to allow "STORED AS > PARQUET" until we have a mechanism to do that for all SerDes, i.e.: someone > picks up: HIVE-5976. (I also don't think this actually works properly > unless we bundle parquet in hive-exec, which I don't think we want.) Again, you could have provided this feedback many moons ago. I am personally interested in HIVE-5976 but it's orthogonal to this issue. That change just makes it easier and cleaner to add STORED AS keywords. The contributors of the Parquet integration are not required to fix Hive. That is our job. > - We should revert the deprecated classes (At least I don't understand how > a first drop needs to add deprecated stuff) The deprecated classes are shells (no actual code) to support existing users of Parquet, of which there are many. I see no justification for impacting existing users when the workaround is trivial and non-impacting to any other user. > In general though, I'm also confused on why adding this SerDe to the hive > code base is beneficial. Seems to me that that just makes upgrading > Parquet, bug fixing, etc more difficult by tying a SerDe release to a Hive > release. To me that outweighs the benefit of a slightly more involved setup > of Hive + serde in the cluster. The Hive APIs, which are not clearly defined, have changed often in the past few releases making maintaining a file format extremely difficult. For example, 0.12 and 0.13 break most if not all external code bases. However, beyond that, the community felt it was beneficial to make Parquet easier to use. If you are not interested in Parquet then ignore it as this change does not impact you. Tez integration is something which does not interest myself and many other Hive developers. Indeed other than a few cursory reviews and a few times where I championed the refactoring you guys were doing in order to support Tez, I have ignored the Tez work. Sincerely, Brock