> On July 12, 2013, 10:44 p.m., Jakob Homan wrote: > > Do you have after-optimization performance numbers? Can you add a test to > > verify that the reencoder cache is working correctly? Feed in a record > > with one uuid, then another with a different and verify that the cache has > > two elements. Adding a third record with the original UUID shouldn't > > increase the size of the cache. Also, that adding n records all with the > > same schema creates only one reencoder...
Yes we have the number after optimization. For example, each record used to take nearly 50 micro-second. After this patch, it becomes nearly 31 micro-seconds. Added the test case as proposed. > On July 12, 2013, 10:44 p.m., Jakob Homan wrote: > > serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java, > > line 66 > > <https://reviews.apache.org/r/12480/diff/1/?file=320688#file320688line66> > > > > verifiedRecordReaders -> noReencodingNeeded ? Done > On July 12, 2013, 10:44 p.m., Jakob Homan wrote: > > serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java, > > line 155 > > <https://reviews.apache.org/r/12480/diff/1/?file=320688#file320688line155> > > > > readability: pull out getRecordReaderID into its own var Done > On July 12, 2013, 10:44 p.m., Jakob Homan wrote: > > serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java, > > line 78 > > <https://reviews.apache.org/r/12480/diff/1/?file=320689#file320689line78> > > > > Need to write out the uuid too Done > On July 12, 2013, 10:44 p.m., Jakob Homan wrote: > > serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java, > > line 92 > > <https://reviews.apache.org/r/12480/diff/1/?file=320689#file320689line92> > > > > Need to read in the uuid too Done - Mohammad ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/#review23113 ----------------------------------------------------------- On July 11, 2013, 10:31 p.m., Mohammad Islam wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/12480/ > ----------------------------------------------------------- > > (Updated July 11, 2013, 10:31 p.m.) > > > Review request for hive, Ashutosh Chauhan and Jakob Homan. > > > Bugs: HIVE-4732 > https://issues.apache.org/jira/browse/HIVE-4732 > > > Repository: hive-git > > > Description > ------- > > From our performance analysis, we found AvroSerde's schema.equals() call > consumed a substantial amount ( nearly 40%) of time. This patch intends to > minimize the number schema.equals() calls by pushing the check as late/fewer > as possible. > > At first, we added a unique id for each record reader which is then included > in every AvroGenericRecordWritable. Then, we introduce two new data > structures (one hashset and one hashmap) to store intermediate data to avoid > duplicates checkings. Hashset contains all the record readers' IDs that don't > need any re-encoding. On the other hand, HashMap contains the already used > re-encoders. It works as cache and allows re-encoders reuse. With this > change, our test shows nearly 40% reduction in Avro record reading time. > > > > > Diffs > ----- > > ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java > dbc999f > serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java > c85ef15 > > serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java > 66f0348 > serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java > 9af751b > serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb > > Diff: https://reviews.apache.org/r/12480/diff/ > > > Testing > ------- > > > Thanks, > > Mohammad Islam > >