[ https://issues.apache.org/jira/browse/HIVE-12898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472993#comment-16472993 ]
ASF GitHub Bot commented on HIVE-12898: --------------------------------------- GitHub user ashish-kumar-sharma opened a pull request: https://github.com/apache/hive/pull/346 HIVE-12898: First commit 1. Predicate Pushdown For Nested field 1.1 Objective In the ORC(Optimized Row Columnar) all the primitive type column consist of index. Predicate refer to the column name in where clause and pushdown mean skipping rows groups, strips and block while reading by comparing the meta store in the strips. Meta consist of max, sum ,min value present in the given column. Currently predicate pushdown only work for top level column of the schema. Extending the Predicate Pushdown for nested structure in hive. 1.2 Current state - 1.2.1 Schema struct<col1:int, col2:bigint,col3:struct<col4:int,col5:struct<col6:int>,col7:string>> 1.2.2 Query select col3.col5.col6 from table where col3.col5.col6 > 10; 1.2.3 Conf Hive.io.filter.expr.serialized = “ASdni2enalfkncwjnlsdnfrnqwoglqernmgkqrg”; Hive.io.filter.text - “where c.e.f > 10”; 1.2.4 Pushdown Predicate not supported in Nested field Generate ExprNodeGenericFuncDesc object which is of type ExprNodeFieldDesc which is serialized and stored in Hive.io.filter.expr.serialized. But while parsing ExprNodeGenericFuncDesc object to generate searchArg in function ConvertAstToSearchArg() there is strict checking of (ExprNodeGenericFuncDesc instanceof ExprNodeColumnDesc). Due to which it completely skip the SearchArgment creation. 1.2.5 Result - builder.literal(SearchArgument.TruthValue.YES_NO_NULL); 1.3 Expected state - 1.3.1 Schema struct<col1:int, col2:bigint,col3:struct<col4:int,col5:struct<col6:int>,col7:string>> 1.3.2 Query select col3.col5.col6 from table where col3.col5.col6 > 10; 1.3.3 Conf Hive.io.filter.expr.serialized = “ASdni2enalfkncwjnlsdnfrnqwoglqernmgkqrg”; Hive.io.filter.text - “where c.e.f > 10”; 1.3.4 Pushdown Predicate support in Nested field Generate ExprNodeGenericFuncDesc object which is of type ExprNodeFieldDesc which is serialized and stored in Hive.io.filter.expr.serialized. But while parsing ExprNodeGenericFuncDesc object to generate searchArg in function ConvertAstToSearchArg() there should also contain an check for ExprNodeFieldDesc and separate parsing plan which convert the fieldName to ColumnID and generate PredicateLeaf nodes. 1.3.5 Result leaf-0 = (LESS_THAN c.e.f 10), expr = (not leaf-0) You can merge this pull request into a Git repository by running: $ git pull https://github.com/Flipkart/hive nestedppd Alternatively you can review and apply these changes as the patch at: https://github.com/apache/hive/pull/346.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #346 ---- commit f3e46b62b4fab6877f2373c49d933ebe7119ec2f Author: Aashish Kumar Sharma <aashish.s@...> Date: 2018-05-12T08:47:39Z HIVE-12898: First commit ---- > Hive should support ORC block skipping on nested fields > ------------------------------------------------------- > > Key: HIVE-12898 > URL: https://issues.apache.org/jira/browse/HIVE-12898 > Project: Hive > Issue Type: Improvement > Components: ORC > Affects Versions: 0.14.0, 1.2.1 > Reporter: Michael Haeusler > Assignee: Ashish Sharma > Priority: Major > Labels: pull-request-available > > Hive supports predicate pushdown (block skipping) for ORC tables only on > top-level fields. Hive should also support block skipping on nested fields > (within structs). > Example top-level: the following query selects 0 rows, using a predicate on > top-level column foo. We also see 0 INPUT_RECORDS in the summary: > {code:sql} > SET hive.tez.exec.print.summary=true; > CREATE TABLE t_toplevel STORED AS ORC AS SELECT 23 AS foo; > SELECT * FROM t_toplevel WHERE foo=42 ORDER BY foo; > [...] > VERTICES TOTAL_TASKS FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS > CPU_TIME_MILLIS GC_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS > Map 1 1 0 0 1.22 > 2,640 102 0 0 > {code} > Example nested: the following query also selects 0 rows, but using a > predicate on nested column foo.bar. Unfortunately we see 1 INPUT_RECORDS in > the summary: > {code:sql} > SET hive.tez.exec.print.summary=true; > CREATE TABLE t_nested STORED AS ORC AS SELECT NAMED_STRUCT('bar', 23) AS foo; > SELECT * FROM t_nested WHERE foo.bar=42 ORDER BY foo; > [...] > VERTICES TOTAL_TASKS FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS > CPU_TIME_MILLIS GC_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS > Map 1 1 0 0 3.66 > 5,210 68 1 0 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)