[ 
https://issues.apache.org/jira/browse/HIVE-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated HIVE-13873:
----------------------------
    Description: 
This is the grounding work for the nested column pruning in Hive, for Parquet 
format. In this patch, we address the case for struct type in select 
statements. In particular, for queries such as:
{code}
select s.a from tbl
{code}
where {{tbl}} has schema:
{code}
s:struct<a:int, b:boolean, c:array<int>>
{code}
then only the field {{a}} should have been scanned in the Parquet reader, while 
field {{b}} and {{c}} can be ignored.

Future work includes support other types of statements, as well as more 
combinations of types (e.g., selecting fields of array type inside a struct 
type).

  was:Some columnar file formats such as Parquet store fields in struct type 
also column by column using encoding described in Google Dramel pager. It's 
very common in big data where data are stored in structs while queries only 
needs a subset of the the fields in the structs. However, presently Hive still 
needs to read the whole struct regardless whether all fields are selected. 
Therefore, pruning unwanted sub-fields in struct or nested fields at file 
reading time would be a big performance boost for such scenarios.


> Support column pruning for struct fields in select statement
> ------------------------------------------------------------
>
>                 Key: HIVE-13873
>                 URL: https://issues.apache.org/jira/browse/HIVE-13873
>             Project: Hive
>          Issue Type: New Feature
>          Components: Logical Optimizer
>            Reporter: Xuefu Zhang
>            Assignee: Ferdinand Xu
>         Attachments: HIVE-13873.1.patch, HIVE-13873.2.patch, 
> HIVE-13873.3.patch, HIVE-13873.4.patch, HIVE-13873.5.patch, 
> HIVE-13873.6.patch, HIVE-13873.patch, HIVE-13873.wip.patch
>
>
> This is the grounding work for the nested column pruning in Hive, for Parquet 
> format. In this patch, we address the case for struct type in select 
> statements. In particular, for queries such as:
> {code}
> select s.a from tbl
> {code}
> where {{tbl}} has schema:
> {code}
> s:struct<a:int, b:boolean, c:array<int>>
> {code}
> then only the field {{a}} should have been scanned in the Parquet reader, 
> while field {{b}} and {{c}} can be ignored.
> Future work includes support other types of statements, as well as more 
> combinations of types (e.g., selecting fields of array type inside a struct 
> type).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to