Xuebin Su has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/22662 )
Change subject: IMPALA-3841: Enable late materialization for collections ...................................................................... IMPALA-3841: Enable late materialization for collections This patch enables late materialization for collections to avoid the cost of materializing collections that will never be accessed by the query. For a collection column, late materialization takes effect only when the collection column is not used in any predicate, including the `!empty()` predicate added by the planner. Otherwise we need to read every row to evaluate the predicate and cannot skip any. Therefore, this patch skips registering the `!empty()` predicates if the query contains zipping unnests. This can affect performance if the table contains many empty collections, but should be noticeable only in very extreme cases. The late materialization threshold is set to 1 in HdfsParquetScanner when there is any collection that can be skipped. This patch also adds the detail of `HdfsScanner::parse_status_` to the error message returned by the HdfsParquetScanner to help figure out the root cause. Performance: - Tests with the queries involving collection columns in table `tpch_nested_parquet.customer` show that when the selectivity is low, the single-threaded (1 impalad and MT_DOP=1) scanning time can be reduced by about 50%, while when the selectivity is high, the scanning time almost does not change. - For queries not involving collections, performance A/B testing shows no regression on TPC-H. Testing: - Added a runtime profile counter NumRowsSkippedByLateMaterialization to record the total number of top-level rows skipped by late materialization for all columns. The counter only counts the rows that are not skipped as a page. - Added e2e test cases in test_parquet_late_materialization.py to ensure that late materialization works using the new counter. Change-Id: Ia21bdfa6811408d66d74367e0a9520e20951105f --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-collection-column-reader.cc M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-column-readers.h M be/src/exec/parquet/parquet-complex-column-reader.h M be/src/exec/parquet/parquet-level-decoder.h M be/src/exec/parquet/parquet-struct-column-reader.cc M be/src/exec/scratch-tuple-batch.h M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M testdata/workloads/functional-planner/queries/PlannerTest/zipping-unnest.test M testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization-unique-db.test M testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test M tests/query_test/test_parquet_late_materialization.py 15 files changed, 183 insertions(+), 33 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/22662/8 -- To view, visit http://gerrit.cloudera.org:8080/22662 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia21bdfa6811408d66d74367e0a9520e20951105f Gerrit-Change-Number: 22662 Gerrit-PatchSet: 8 Gerrit-Owner: Xuebin Su <x...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Xuebin Su <x...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>