[jira] [Commented] (IMPALA-9873) Skip decoding of non-materialised columns in Parquet

ASF subversion and git services (Jira) Wed, 21 May 2025 23:28:05 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953330#comment-17953330
 ]


ASF subversion and git services commented on IMPALA-9873:
---------------------------------------------------------

Commit a0b3ae4e028330d26893fe1baeb715f425444b75 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=a0b3ae4e0 ]

IMPALA-11396: Deflake test_low_mem_limit_orderby_all

test_low_mem_limit_orderby_all is flaking if test_mem_limit equals 100
and 120 in test vector. The minimum mem_limit to run this test is 120MB
+ 30MB = 150MB. Thus, this test vector expect one of
MEM_LIMIT_ERROR_MSGS will be thrown because mem_limit (test_mem_limit)
is not enough.

Parquet scan under this low mem_limit sometimes throws "Couldn't skip
rows in column" error instead. This possibly indicate memory exhaustion
happen while reading parquet page index or late materialization (see
IMPALA-5843, IMPALA-9873, IMPALA-11134). This patch attempt to deflake
the test by adding "Couldn't skip rows in column" into
MEM_LIMIT_ERROR_MSGS.

Change-Id: I43a953bc19b40256e3a8fe473b1498bbe477c54d
Reviewed-on: http://gerrit.cloudera.org:8080/22932
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Skip decoding of non-materialised columns in Parquet
> ----------------------------------------------------
>
>                 Key: IMPALA-9873
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9873
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Amogh Margoor
>            Priority: Critical
>             Fix For: Impala 4.1.0
>
>
> This is a first milestone for lazy materialization in parquet, focusing on 
> avoiding decompression and decoding of columns.
> * Identify columns referenced by predicates and runtime row filters and 
> determine what order the columns need to be materialised in. Probably we want 
> to evaluate static predicates before runtime filters to match current 
> behaviour.
> * Rework this loop so that it alternates between materialising columns and 
> evaluating predicates: 
> https://github.com/apache/impala/blob/052129c/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1110
> * We probably need to keep track of filtered rows using a new data structure, 
> e.g. bitmap
> * We need to then check that bitmap at each step to see if we skip 
> materialising part or all of the following columns. E.g. if the first N rows 
> were pruned, we can skip forward the remaining readers N rows.
> * This part may be a little tricky - there is the risk of adding overhead 
> compared to the current code.
> * It is probably OK to just materialise the partition columns to start off 
> with - avoiding materialising those is not going to buy that much.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-9873) Skip decoding of non-materialised columns in Parquet

Reply via email to