[ https://issues.apache.org/jira/browse/HIVE-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634790#comment-14634790 ]
Dong Chen commented on HIVE-8128: --------------------------------- Patch V6 updated. Review board: https://reviews.apache.org/r/36540/ The patch depends on the new Parquet vector API at https://github.com/nezihyigitbasi-nflx/parquet-mr/commits/vector In this POC, the general workflow was done, two tests passed, and INT type was supported. The idea is that we create a VectorizedParquetRecordReader, which wraps the ParquetRecordReader provided by Parquet. Then in its next() method, we convert Parquet RowBatch to Hive VectorizedRowBatch. This is the first patch. To complete vectorization feature, we still have work to do in follow-up: 1) support all data types 2) support partition column 3) add more test cases 4) evaluate performance on a real cluster. > Improve Parquet Vectorization > ----------------------------- > > Key: HIVE-8128 > URL: https://issues.apache.org/jira/browse/HIVE-8128 > Project: Hive > Issue Type: Sub-task > Reporter: Brock Noland > Assignee: Dong Chen > Fix For: parquet-branch > > Attachments: HIVE-8128-parquet.patch.POC, HIVE-8128.1-parquet.patch > > > NO PRECOMMIT TESTS > We'll want to do is finish the vectorization work (e.g. VectorizedOrcSerde, > VectorizedOrcSerde) which was partially done in HIVE-5998. > As discussed in PARQUET-131, we will work out Hive POC based on the new > Parquet vectorized API, and then finish the implementation after finilized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)