[ https://issues.apache.org/jira/browse/HIVE-17394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163121#comment-16163121 ]
Carl Steinbach commented on HIVE-17394: --------------------------------------- Nice catch! +1. Will commit if tests pass. > AvroSerde is regenerating TypeInfo objects for each nullable Avro field for > every row > ------------------------------------------------------------------------------------- > > Key: HIVE-17394 > URL: https://issues.apache.org/jira/browse/HIVE-17394 > Project: Hive > Issue Type: Bug > Affects Versions: 1.1.0, 3.0.0 > Reporter: Ratandeep Ratti > Assignee: Anthony Hsu > Attachments: AvroSerDe.nps, AvroSerDeUnionTypeInfo.png, > HIVE-17394.1.patch > > > The following methods in {{AvroDeserializer}} keep regenerating {{TypeInfo}} > objects for every nullable field in a row. > This is happening in the following methods. > {code} > private Object deserializeNullableUnion(Object datum, Schema fileSchema, > Schema recordSchema) throws AvroSerdeException { > // elided > line 312: return worker(datum, fileSchema, newRecordSchema, > SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null)); > } > .. > private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema > recordSchema) > // elided > line 357: return worker(datum, currentFileSchema, schema, > SchemaToTypeInfo.generateTypeInfo(schema, null)); > {code} > This is really bad in terms of performance. I'm not sure why didn't we use > the TypeInfo we already have instead of generating again for each nullable > field. If you look at the {{worker}} method which calls the method > {{deserializeNullableUnion}} the typeInfo corresponding to the nullable field > column is already determined. > Moreover the cache in {{SchmaToTypeInfo}} class does not help in nullable > Avro records case as checking if an Avro record schema object already exists > in the cache requires traversing all the fields in the record schema. > I've attached profiling snapshot which shows maximum time is being spent in > the cache. > One way of fixing this IMO might be to make use of the column TypeInfo which > is already passed in the worker method. -- This message was sent by Atlassian JIRA (v6.4.14#64029)