[ 
https://issues.apache.org/jira/browse/HIVE-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374395#comment-15374395
 ] 

Ferdinand Xu commented on HIVE-13873:
-------------------------------------

Thanks [~xuefuz] for your review.

{quote}
1. nested column pruning should goes beyond just select op or groupby op. 
{quote}
Good catch. I will take this into consideration in my next patch.
{quote}
2. Secondly, there may need a consolidation/merging process in determining 
finally read schema. For instance,
{noformat}
select msg from t where msg.a='x';
{noformat}
In this case, the projected column should be just msg rather than msg + msg.a.
{quote}
OK, the logic will be first to check whether a sub attribution will be 
filtered. If so, the other attributions within the same struct will not be 
filtered. Will update the patch as well.

{quote}
3. While it's fine to support just struct at first, we may need to consider how 
to find a more extensible way to pass the projected fields to the reader to 
support other types (array and map). I have no idea on this, so love to hear 
your thoughts.
{quote}
Good suggestion, I have considered it before. Given array as an example, it 
will generate the schema like this.
{noformat}
optional group max_nested_map (LIST) {
    repeated group bag {
      optional group array_element (LIST) {
         required binary key (UTF8);
      }
}
{noformat}
For bag and array_element, they're generated in Parquet side and Hive will not 
be aware of the column path. One way I am thinking is to serializing typeinfo 
and types which used to generate the full schema to generate the requested 
schema. More investigations are needed anyway. Maybe we could make it happen in 
another ticket.

> Column pruning for nested fields
> --------------------------------
>
>                 Key: HIVE-13873
>                 URL: https://issues.apache.org/jira/browse/HIVE-13873
>             Project: Hive
>          Issue Type: New Feature
>          Components: Logical Optimizer
>            Reporter: Xuefu Zhang
>            Assignee: Ferdinand Xu
>         Attachments: HIVE-13873.wip.patch
>
>
> Some columnar file formats such as Parquet store fields in struct type also 
> column by column using encoding described in Google Dramel pager. It's very 
> common in big data where data are stored in structs while queries only needs 
> a subset of the the fields in the structs. However, presently Hive still 
> needs to read the whole struct regardless whether all fields are selected. 
> Therefore, pruning unwanted sub-fields in struct or nested fields at file 
> reading time would be a big performance boost for such scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to