[ 
https://issues.apache.org/jira/browse/HIVE-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634790#comment-14634790
 ] 

Dong Chen commented on HIVE-8128:
---------------------------------

Patch V6 updated. Review board: https://reviews.apache.org/r/36540/

The patch depends on the new Parquet vector API at 
https://github.com/nezihyigitbasi-nflx/parquet-mr/commits/vector

In this POC, the general workflow was done, two tests passed, and INT type was 
supported. The idea is that we create a VectorizedParquetRecordReader, which 
wraps the ParquetRecordReader provided by Parquet. Then in its next() method, 
we convert Parquet RowBatch to Hive VectorizedRowBatch.

This is the first patch. To complete vectorization feature, we still have work 
to do in follow-up: 1) support all data types 2) support partition column 3) 
add more test cases 4) evaluate performance on a real cluster.

> Improve Parquet Vectorization
> -----------------------------
>
>                 Key: HIVE-8128
>                 URL: https://issues.apache.org/jira/browse/HIVE-8128
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Brock Noland
>            Assignee: Dong Chen
>             Fix For: parquet-branch
>
>         Attachments: HIVE-8128-parquet.patch.POC, HIVE-8128.1-parquet.patch
>
>
> NO PRECOMMIT TESTS
> We'll want to do is finish the vectorization work (e.g. VectorizedOrcSerde, 
> VectorizedOrcSerde) which was partially done in HIVE-5998.
> As discussed in PARQUET-131, we will work out Hive POC based on the new 
> Parquet vectorized API, and then finish the implementation after finilized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to