Xuebin Su has uploaded a new patch set (#9). ( 
http://gerrit.cloudera.org:8080/22662 )

Change subject: IMPALA-3841: Enable late materialization for collections
......................................................................

IMPALA-3841: Enable late materialization for collections

This patch enables late materialization for collections to avoid the
cost of materializing collections that will never be accessed by the
query.

For a collection column, late materialization takes effect only when the
collection column is not used in any predicate, including the `!empty()`
predicate added by the planner. Otherwise we need to read every row to
evaluate the predicate and cannot skip any. Therefore, this patch skips
registering the `!empty()` predicates if the query contains zipping
unnests. This can affect performance if the table contains many empty
collections, but should be noticeable only in very extreme cases.

The late materialization threshold is set to 1 in HdfsParquetScanner
when there is any collection that can be skipped.

This patch also adds the detail of `HdfsScanner::parse_status_` to the
error message returned by the HdfsParquetScanner to help figure out the
root cause.

Performance:
- Tests with the queries involving collection columns in table
  `tpch_nested_parquet.customer` show that when the selectivity is low,
  the single-threaded (1 impalad and MT_DOP=1) scanning time can be
  reduced by about 50%, while when the selectivity is high, the scanning
  time almost does not change.
- For queries not involving collections, performance A/B testing
  shows no regression on TPC-H.

Testing:
- Added a runtime profile counter NumRowsSkipped to record the total
  number of top-level rows skipped for all columns. The counter only
  counts the rows that are not skipped as a page.
- Added e2e test cases in test_parquet_late_materialization.py to ensure
  that late materialization works using the new counter.

Change-Id: Ia21bdfa6811408d66d74367e0a9520e20951105f
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/parquet/parquet-complex-column-reader.h
M be/src/exec/parquet/parquet-level-decoder.h
M be/src/exec/parquet/parquet-struct-column-reader.cc
M be/src/exec/scratch-tuple-batch.h
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M testdata/workloads/functional-planner/queries/PlannerTest/zipping-unnest.test
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization-unique-db.test
M 
testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test
M tests/query_test/test_parquet_late_materialization.py
15 files changed, 178 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/22662/9
--
To view, visit http://gerrit.cloudera.org:8080/22662
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia21bdfa6811408d66d74367e0a9520e20951105f
Gerrit-Change-Number: 22662
Gerrit-PatchSet: 9
Gerrit-Owner: Xuebin Su <x...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Xuebin Su <x...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to