[ 
https://issues.apache.org/jira/browse/HIVE-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624378#comment-14624378
 ] 

Dong Chen commented on HIVE-8128:
---------------------------------

Hi [~nezihyigitbasi], I updated and run Hive POC based on the latest changes at 
your repo: https://github.com/nezihyigitbasi-nflx/parquet-mr/commits/vector
All looks good. Thanks. 

During development, I got some thoughts about the vector API. Could you help to 
take a look at them?

* In {{ColumnVector}}, how about adding two attributes: one is {{boolean 
noNulls}}, which indicates whether the whole column vector has no null value. 
The other is {{boolean isRepeating}}, which indicates whether the same value 
repeats for whole column vector. They could be calculated at the same time when 
we read a vector. 
The reason we want them is that Hive vector engine can check these attribute to 
skip some values. And it might be better to calculate them in Parquet once, 
instead of calculate them by re-visit vectors again in Hive. (Not sure other 
engines need this. But it should be ok that Parquet supports this.)
* In {{RowBatch}}, how about adding one attribute {{int size}}, which indicates 
the number of rows in this batch. This is just for easy usage. Its value should 
be the same as {{RowBatch.columns\[0\].numValues}}.

What do you think?

> Improve Parquet Vectorization
> -----------------------------
>
>                 Key: HIVE-8128
>                 URL: https://issues.apache.org/jira/browse/HIVE-8128
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Brock Noland
>            Assignee: Dong Chen
>             Fix For: parquet-branch
>
>         Attachments: HIVE-8128-parquet.patch.POC, HIVE-8128.1-parquet.patch
>
>
> NO PRECOMMIT TESTS
> We'll want to do is finish the vectorization work (e.g. VectorizedOrcSerde, 
> VectorizedOrcSerde) which was partially done in HIVE-5998.
> As discussed in PARQUET-131, we will work out Hive POC based on the new 
> Parquet vectorized API, and then finish the implementation after finilized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to