[ https://issues.apache.org/jira/browse/HIVE-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13966341#comment-13966341 ]
Justin Coffey commented on HIVE-6784: ------------------------------------- You've cited a "lazy" serde. Parquet is not "lazy". It is similar to ORC. Have a look ORC's deserialize() method (org.apache.hadoop.hive.ql.io.orc.OrcSerde): {code} @Override public Object deserialize(Writable writable) throws SerDeException { return writable; } {code} A quick look through ORC code indicates to me that they don't do any reparsing (though I might have missed something). Looking through other serde's not a single one (that I checked) reparses values. Value parsing is handled in ObjectInspectors (poke around org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils). In my opinion, the *substantial* performance penalty that you are introducing with this patch is going to be a much bigger negative to adopting parquet than obliging people to rebuild their data set in the rare event that you have to change a type. And if you do need to change a type, insert overwrite table is a good work around. -1 > parquet-hive should allow column type change > -------------------------------------------- > > Key: HIVE-6784 > URL: https://issues.apache.org/jira/browse/HIVE-6784 > Project: Hive > Issue Type: Bug > Components: File Formats, Serializers/Deserializers > Affects Versions: 0.13.0 > Reporter: Tongjie Chen > Fix For: 0.14.0 > > Attachments: HIVE-6784.1.patch.txt, HIVE-6784.2.patch.txt > > > see also in the following parquet issue: > https://github.com/Parquet/parquet-mr/issues/323 > Currently, if we change parquet format hive table using "alter table > parquet_table change c1 c1 bigint " ( assuming original type of c1 is int), > it will result in exception thrown from SerDe: > "org.apache.hadoop.io.IntWritable cannot be cast to > org.apache.hadoop.io.LongWritable" in query runtime. > This is different behavior from hive (using other file format), where it will > try to perform cast (null value in case of incompatible type). > Parquet Hive's RecordReader returns an ArrayWritable (based on schema stored > in footers of parquet files); ParquetHiveSerDe also creates an corresponding > ArrayWritableObjectInspector (but using column type info from metastore). > Whenever there is column type change, the objector inspector will throw > exception, since WritableLongObjectInspector cannot inspect an IntWritable > etc... > Conversion has to happen somewhere if we want to allow type change. SerDe's > deserialize method seems a natural place for it. > Currently, serialize method calls createStruct (then createPrimitive) for > every record, but it creates a new object regardless, which seems expensive. > I think that could be optimized a bit by just returning the object passed if > already of the right type. deserialize also reuse this method, if there is a > type change, there will be new object to be created, which I think is > inevitable. -- This message was sent by Atlassian JIRA (v6.2#6252)