On Tue, May 28, 2013 at 8:45 AM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> That does not really make sense. Your breaking the layered approache.
> InputFormats read/write data, serdes interpret data based on the table
> definition. its like asking "Why can't my input format run assembly code?"
>

The current model of:

SerDe
Input/OutputFormat
FileSystem

does well for text formats, but otherwise limits the input/output formats
to doing binary data. That creates problems if the Input/OutputFormat has
an integrated serialization mechanism. For example, ORC requires its SerDe
and the OrcSerde just passes along the values through serialize and
deserialize.

Also note that other formats like SequenceFile are restricted because the
SerDe is placed above the FileFormat. Hive's SequenceFile input format
discards the key and requires the value to be Text or BytesWritable. That
covers many cases, but certainly not all. On the other hand, if it was
Hive's SequenceFile InputFormat that was creating the ObjectInspector, it
could actually handle more complex types and let Hive usefully read a wider
range of SequenceFiles.

I would propose that it would be better to push SerDes down into the
Input/OutputFormats that can be parameterized by the serialization. Using
them for TextInput/OutputFormat and HBaseTableInput/OutputFormat makes a
lot of sense, but in general that isn't true.

-- Owen

Reply via email to