[ 
https://issues.apache.org/jira/browse/HIVE-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374184#comment-15374184
 ] 

Xuefu Zhang commented on HIVE-13873:
------------------------------------

[~Ferd], thanks for working on this. Patch looks good for the initial cut as I 
went through the patch. Here I have a couple of immature thoughts to share with 
you:

1. nested column pruning should goes beyond just select op or groupby op. For 
instance, 
{code}
select msg.a from t where msg.b = 'x';
{code}
In this case, parquet reader should only read a and b from msg field. Thus, I 
think we need to consider expressions from more operators.

2. Secondly, there may need a consolidation/merging process in determining 
finally read schema. For instance,
{code}
select msg from t where msg.a='x';
{code}
In this case, the projected column should be just msg rather than msg + msg.a.

3. While it's fine to support just struct at first, we may need to consider how 
to find a more extensible way to pass the projected fields to the reader to 
support other types (array and map). I have no idea on this, so love to hear 
your thoughts.


> Column pruning for nested fields
> --------------------------------
>
>                 Key: HIVE-13873
>                 URL: https://issues.apache.org/jira/browse/HIVE-13873
>             Project: Hive
>          Issue Type: New Feature
>          Components: Logical Optimizer
>            Reporter: Xuefu Zhang
>            Assignee: Ferdinand Xu
>         Attachments: HIVE-13873.wip.patch
>
>
> Some columnar file formats such as Parquet store fields in struct type also 
> column by column using encoding described in Google Dramel pager. It's very 
> common in big data where data are stored in structs while queries only needs 
> a subset of the the fields in the structs. However, presently Hive still 
> needs to read the whole struct regardless whether all fields are selected. 
> Therefore, pruning unwanted sub-fields in struct or nested fields at file 
> reading time would be a big performance boost for such scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to