Ratandeep Ratti created HIVE-17394:
--------------------------------------
Summary: AvroSerde is regenerating TypeInfo objects for each
nullable Avro field in a row
Key: HIVE-17394
URL: https://issues.apache.org/jira/browse/HIVE-17394
Project: Hive
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Ratandeep Ratti
The following methods in {{AvroDeserializer}} keep regenerating TypeInfo
objects for every nullable field in a row.
This is happening in the following methods.
{code}
private Object deserializeNullableUnion(Object datum, Schema fileSchema, Schema
recordSchema) throws AvroSerdeException {
// elided
line 312: return worker(datum, fileSchema, newRecordSchema,
SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null));
}
..
private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema
recordSchema)
// elided
line 357: return worker(datum, currentFileSchema, schema,
SchemaToTypeInfo.generateTypeInfo(schema, null));
{code}
This is really bad in terms of performance. I'm not sure why didn't we use the
TypeInfo we already have instead of generating again for each nullable field.
If you look at the {{worker}} method which calls the method
{{deserializeNullableUnion}} the typeInfo corresponding to the nullable field
column is already determined. Not sure why we have to determine that
information again.
More the cache in SchmaToTypeInfo does not help in nullable Avro records case
as checking if an Avro record schema object already exists in the cache
requires traversing the all the fields in the record schema.
I've attached profiling snapshot which shows maximum time is being spent in the
cache.
One way of fixing this IMO is to make use of the column TypeInfo which is
already passed in the worker method.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)