On Tue, May 28, 2013 at 9:27 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> The question we are diving into is how much of hive is going to be
> designed around edge cases? Hive really was not made for columnar formats,
> or self describing data-types. For the most part it handles them fairly
> well.
>

I don't view columnar or self describing data-types as an edge case. I
think in a couple years, the various columnar stores (ORC, Parquet, or new
ones) and text will be the primary formats. Given the performance advantage
of binary formats, text should only be used for staging tables.


> I am not sure what I believe about refactoring all of hive's guts. How
> much refactoring and complexity are we going to add to support special
> cases? I do not think we can justify sweeping API changes for the sake of
> one new input format, or something that can be done in some other way.
>

The problem is actually, much bigger. We have a wide range of nested
abstractions for input/output that all interact in various ways.

org.apache.hadoop.mapred.InputFormat
org.apache.hadoop.hive.ql.io.HiveInputFormat
org.apache.hadoop.hive.ql.meta.HiveStorageHandler
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
org.apache.hadoop.hive.serde2.SerDe

I would suggest that there is a lot of confusion about the current state of
what is allowed and what will break things. Furthermore, because critical
functionality like accessing table properties, partition properties,
columnar projection, and predicate pushdown has been added incrementally,
it isn't clear at all how to users what is available and how to take
advantage of them.

-- Owen

Reply via email to