Zhu Li created HIVE-14130:
-----------------------------

             Summary: Performance 
                 Key: HIVE-14130
                 URL: https://issues.apache.org/jira/browse/HIVE-14130
             Project: Hive
          Issue Type: Improvement
          Components: HCatalog
            Reporter: Zhu Li
            Assignee: Zhu Li


1. In HCatalog,  the code used for lazy deserialization in 
HCatRecordReader.java uses a method named getPosition(fieldName) for getting 
index of a filed in a row. When it is invoked, it also invokes toLowerCase() 
method for the String variable fieldName. This is trivial when data size is 
small, but when data size is huge, repeated invocations of toLowerCase() for 
the same set of fieldNames wastes some time. So storing the indices for the 
columns names in HcatRecordReader class or storing lower-case fieldNames in 
outputSchema will improve efficiency. 

2. HCatRecordReader.java is creating new instance of DefaultHCatRecord 
repeatedly for every new incoming row of data. This causes a waste of time. 
Adding a private variable of DefaultHCatRecord in this class and using it 
repeatedly for new rows will reduce some overhead.

3. Method serializePrimitiveField in class HCatRecordSerDe.java is invoking 
HCatContext.INSTANCE.getConf() repeatedly. This also causes some overhead 
according to result by JProfiler. Adding a static boolean field in 
HCatRecordSerDe.java which stores HCatContext.INSTANCE.getConf().isPresent() 
and another static Configuration variable which stores result of 
HCatContext.INSTANCE.getConf() also reduces overhead.

 According to my test on a cluster, using the above modifications we can save 
80 seconds or so when HCatalog is used to load a table in size of 1 
billion(rows) * 40(columns) with various data types. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to