----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30281/#review72053 -----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java <https://reviews.apache.org/r/30281/#comment117985> Will the keyValue ever be null? I don't think map implementations ever return nulls here. ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java <https://reviews.apache.org/r/30281/#comment117995> Like the test for PARQUET-113 structure, this should test the old structure, using "bag" and "array_element" for the names. ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java <https://reviews.apache.org/r/30281/#comment117992> I think part of the reasoning behind this was to make sure if the Parquet code is passed a schema with required fields, it will correctly write out required fields rather than rejecting them. That might be the case in the future (if we add an underlying schema, for example), so I think this should continue to use required. ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java <https://reviews.apache.org/r/30281/#comment117994> Part of the purpose is to validate that the write code accepts schemas with the new structure mandated by PARQUET-113. This would be a good place, so I think this should actually be "array" => "list" and "element" for elements. That way we're testing the new structure exactly. - Ryan Blue On Feb. 11, 2015, 3:19 p.m., Sergio Pena wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/30281/ > ----------------------------------------------------------- > > (Updated Feb. 11, 2015, 3:19 p.m.) > > > Review request for hive, Ryan Blue, cheng xu, and Dong Chen. > > > Bugs: HIVE-9333 > https://issues.apache.org/jira/browse/HIVE-9333 > > > Repository: hive-git > > > Description > ------- > > This patch moves the ParquetHiveSerDe.serialize() implementation to > DataWritableWriter class in order to save time in materializing data on > serialize(). > > > Diffs > ----- > > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java > ea4109d358f7c48d1e2042e5da299475de4a0a29 > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java > 9199127735533f9a324c5ef456786dda10766c46 > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java > 060b1b722d32f3b2f88304a1a73eb249e150294b > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java > 1d83bf31a3dbcbaa68b3e75a72cec2ec67e7faa5 > > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java > e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 > > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java > a693aff18516d133abf0aae4847d3fe00b9f1c96 > > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java > 667d3671547190d363107019cd9a2d105d26d336 > ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java > 007a665529857bcec612f638a157aa5043562a15 > serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetHiveRecord.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/30281/diff/ > > > Testing > ------- > > The tests run were the following: > > 1. JMH (Java microbenchmark) > > This benchmark called parquet serialize/write methods using text writable > objects. > > Class.method Before Change (ops/s) After Change (ops/s) > > ------------------------------------------------------------------------------- > ParquetHiveSerDe.serialize: 19,113 249,528 -> > 19x speed increase > DataWritableWriter.write: 5,033 5,201 -> > 3.34% speed increase > > > 2. Write 20 million rows (~1GB file) from Text to Parquet > > I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format > using the following > statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text; > > Time (s) it took to write the whole file BEFORE changes: 93.758 s > Time (s) it took to write the whole file AFTER changes: 83.903 s > > It got a 10% of speed inscrease. > > > Thanks, > > Sergio Pena > >