Re: Review Request 30281: HIVE-9333: Move parquet serialize implementation to DataWritableWriter to improve write speeds

Ryan Blue Wed, 11 Feb 2015 15:41:15 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/#review72053
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment117985>

    Will the keyValue ever be null? I don't think map implementations ever 
return nulls here.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment117995>

    Like the test for PARQUET-113 structure, this should test the old 
structure, using "bag" and "array_element" for the names.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment117992>

    I think part of the reasoning behind this was to make sure if the Parquet 
code is passed a schema with required fields, it will correctly write out 
required fields rather than rejecting them. That might be the case in the 
future (if we add an underlying schema, for example), so I think this should 
continue to use required.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment117994>

    Part of the purpose is to validate that the write code accepts schemas with 
the new structure mandated by PARQUET-113. This would be a good place, so I 
think this should actually be "array" => "list" and "element" for elements. 
That way we're testing the new structure exactly.


- Ryan Blue


On Feb. 11, 2015, 3:19 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30281/
> -----------------------------------------------------------
> 
> (Updated Feb. 11, 2015, 3:19 p.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-9333
>     https://issues.apache.org/jira/browse/HIVE-9333
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This patch moves the ParquetHiveSerDe.serialize() implementation to 
> DataWritableWriter class in order to save time in materializing data on 
> serialize().
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java
>  ea4109d358f7c48d1e2042e5da299475de4a0a29 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 
> 9199127735533f9a324c5ef456786dda10766c46 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
>  060b1b722d32f3b2f88304a1a73eb249e150294b 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  1d83bf31a3dbcbaa68b3e75a72cec2ec67e7faa5 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java
>  e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 
> a693aff18516d133abf0aae4847d3fe00b9f1c96 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java
>  667d3671547190d363107019cd9a2d105d26d336 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 
> 007a665529857bcec612f638a157aa5043562a15 
>   serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetHiveRecord.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/30281/diff/
> 
> 
> Testing
> -------
> 
> The tests run were the following:
> 
> 1. JMH (Java microbenchmark)
> 
> This benchmark called parquet serialize/write methods using text writable 
> objects. 
> 
> Class.method                  Before Change (ops/s)      After Change (ops/s) 
>       
> -------------------------------------------------------------------------------
> ParquetHiveSerDe.serialize:          19,113                   249,528   ->  
> 19x speed increase
> DataWritableWriter.write:             5,033                     5,201   ->  
> 3.34% speed increase
> 
> 
> 2. Write 20 million rows (~1GB file) from Text to Parquet
> 
> I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format 
> using the following
> statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;
> 
> Time (s) it took to write the whole file BEFORE changes: 93.758 s
> Time (s) it took to write the whole file AFTER changes: 83.903 s
> 
> It got a 10% of speed inscrease.
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 30281: HIVE-9333: Move parquet serialize implementation to DataWritableWriter to improve write speeds

Reply via email to