ORC, Parquet, and "new ones" are ... new. They do not constitute a huge portion of the user base if they constitute any at all.
I do see a case for what you are describing, currently there are input formats that do properties via the configuration to the task. Also I feel like some of the confusion you are describing centers around someone building code with the goal of only one input format in mind or one use case. So when I see "OCR" "could benefit" and "refactor" in the same email alarms go off. They may be false alarms but if the feature does not immediately benefit two input formats, its creating fragmentation and I am not behind it. On Tue, May 28, 2013 at 1:31 PM, Owen O'Malley <omal...@apache.org> wrote: > On Tue, May 28, 2013 at 9:27 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > >> The question we are diving into is how much of hive is going to be >> designed around edge cases? Hive really was not made for columnar formats, >> or self describing data-types. For the most part it handles them fairly >> well. >> > > I don't view columnar or self describing data-types as an edge case. I > think in a couple years, the various columnar stores (ORC, Parquet, or new > ones) and text will be the primary formats. Given the performance advantage > of binary formats, text should only be used for staging tables. > > >> I am not sure what I believe about refactoring all of hive's guts. How >> much refactoring and complexity are we going to add to support special >> cases? I do not think we can justify sweeping API changes for the sake of >> one new input format, or something that can be done in some other way. >> > > The problem is actually, much bigger. We have a wide range of nested > abstractions for input/output that all interact in various ways. > > org.apache.hadoop.mapred.InputFormat > org.apache.hadoop.hive.ql.io.HiveInputFormat > org.apache.hadoop.hive.ql.meta.HiveStorageHandler > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat > org.apache.hadoop.hive.serde2.SerDe > > I would suggest that there is a lot of confusion about the current state > of what is allowed and what will break things. Furthermore, because > critical functionality like accessing table properties, partition > properties, columnar projection, and predicate pushdown has been added > incrementally, it isn't clear at all how to users what is available and how > to take advantage of them. > > -- Owen > >