[ 
https://issues.apache.org/jira/browse/HIVE-12898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472993#comment-16472993
 ] 

ASF GitHub Bot commented on HIVE-12898:
---------------------------------------

GitHub user ashish-kumar-sharma opened a pull request:

    https://github.com/apache/hive/pull/346

    HIVE-12898: First commit

    1. Predicate Pushdown For Nested field
    
    1.1 Objective
    
    In the ORC(Optimized Row Columnar) all the primitive type column consist of 
index. Predicate refer to the column name in where clause and pushdown mean 
skipping rows groups, strips and block while reading by comparing the meta 
store in the strips. Meta consist of max, sum ,min value present in the given 
column. 
    
    Currently predicate pushdown only work for top level column of the schema. 
Extending the Predicate Pushdown for nested structure in hive.  
    
    
    1.2 Current state - 
     
    1.2.1 Schema
    struct<col1:int, 
col2:bigint,col3:struct<col4:int,col5:struct<col6:int>,col7:string>>
     
    1.2.2 Query 
    select col3.col5.col6 from table where col3.col5.col6 > 10;
     
    1.2.3 Conf 
    Hive.io.filter.expr.serialized = “ASdni2enalfkncwjnlsdnfrnqwoglqernmgkqrg”;
    Hive.io.filter.text - “where c.e.f > 10”;
     
    1.2.4 Pushdown Predicate not supported in Nested field
     
    Generate ExprNodeGenericFuncDesc  object which is of type ExprNodeFieldDesc 
which is serialized and stored in Hive.io.filter.expr.serialized.
    
    But while parsing ExprNodeGenericFuncDesc object to generate searchArg in 
function ConvertAstToSearchArg() there is strict checking of  
(ExprNodeGenericFuncDesc instanceof ExprNodeColumnDesc). Due to which it 
completely skip the SearchArgment creation.  
    
    
    1.2.5 Result - 
    
    builder.literal(SearchArgument.TruthValue.YES_NO_NULL);
    
    1.3 Expected state - 
    
    1.3.1 Schema
    struct<col1:int, 
col2:bigint,col3:struct<col4:int,col5:struct<col6:int>,col7:string>>
     
    1.3.2 Query
    select col3.col5.col6 from table where col3.col5.col6 > 10;
     
    1.3.3 Conf
    Hive.io.filter.expr.serialized = “ASdni2enalfkncwjnlsdnfrnqwoglqernmgkqrg”;
    Hive.io.filter.text - “where c.e.f > 10”;
     
    1.3.4 Pushdown Predicate support in Nested field
     
    Generate ExprNodeGenericFuncDesc  object which is of type ExprNodeFieldDesc 
which is serialized and stored in Hive.io.filter.expr.serialized.
    
    But while parsing ExprNodeGenericFuncDesc object to generate searchArg in 
function ConvertAstToSearchArg() there should also contain an check for 
ExprNodeFieldDesc and separate parsing plan which convert the fieldName to 
ColumnID and generate PredicateLeaf nodes.
    
    1.3.5 Result
    
    leaf-0 = (LESS_THAN c.e.f 10), expr = (not leaf-0)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Flipkart/hive nestedppd

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hive/pull/346.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #346
    
----
commit f3e46b62b4fab6877f2373c49d933ebe7119ec2f
Author: Aashish Kumar Sharma <aashish.s@...>
Date:   2018-05-12T08:47:39Z

    HIVE-12898: First commit

----


> Hive should support ORC block skipping on nested fields
> -------------------------------------------------------
>
>                 Key: HIVE-12898
>                 URL: https://issues.apache.org/jira/browse/HIVE-12898
>             Project: Hive
>          Issue Type: Improvement
>          Components: ORC
>    Affects Versions: 0.14.0, 1.2.1
>            Reporter: Michael Haeusler
>            Assignee: Ashish Sharma
>            Priority: Major
>              Labels: pull-request-available
>
> Hive supports predicate pushdown (block skipping) for ORC tables only on 
> top-level fields. Hive should also support block skipping on nested fields 
> (within structs).
> Example top-level: the following query selects 0 rows, using a predicate on 
> top-level column foo. We also see 0 INPUT_RECORDS in the summary:
> {code:sql}
> SET hive.tez.exec.print.summary=true;
> CREATE TABLE t_toplevel STORED AS ORC AS SELECT 23 AS foo;
> SELECT * FROM t_toplevel WHERE foo=42 ORDER BY foo;
> [...]
> VERTICES         TOTAL_TASKS  FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS   
>  CPU_TIME_MILLIS     GC_TIME_MILLIS  INPUT_RECORDS   OUTPUT_RECORDS
> Map 1                      1                0            0             1.22   
>            2,640                102              0                0
> {code}
> Example nested: the following query also selects 0 rows, but using a 
> predicate on nested column foo.bar. Unfortunately we see 1 INPUT_RECORDS in 
> the summary:
> {code:sql}
> SET hive.tez.exec.print.summary=true;
> CREATE TABLE t_nested STORED AS ORC AS SELECT NAMED_STRUCT('bar', 23) AS foo;
> SELECT * FROM t_nested WHERE foo.bar=42 ORDER BY foo;
> [...]
> VERTICES         TOTAL_TASKS  FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS   
>  CPU_TIME_MILLIS     GC_TIME_MILLIS  INPUT_RECORDS   OUTPUT_RECORDS
> Map 1                      1                0            0             3.66   
>            5,210                 68              1                0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to