[jira] [Commented] (IMPALA-13598) OPTIMIZE redundantly accumulates memory in HDFS WRITER

ASF subversion and git services (Jira) Wed, 11 Dec 2024 14:30:48 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17904946#comment-17904946
 ]


ASF subversion and git services commented on IMPALA-13598:
----------------------------------------------------------

Commit d086babdbd249df0069900739f24da280b06a279 in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d086babdb ]

IMPALA-13598: OPTIMIZE redundantly accumulates memory in HDFS WRITER

When OptimizeStmt created the table sink it didn't set
'inputIsClustered' to true. Therefore HdfsTableSink expected
random input and kept the output writers open for every partition,
which resulted in high memory consumption and potentially an
OOM error when the number of partitions are high.

Since we actually sort the rows before the sink we can set
'inputIsClustered' to true, which means HdfsTableSink can write
files one by one, because whenever it gets a row that belongs
to a new partition it knows that it can close the current output
writer, and open a new one.

Testing:
 * added e2e test

Change-Id: I8d451c50c4b6dff9433ab105493051bee106bc63
Reviewed-on: http://gerrit.cloudera.org:8080/22192
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> OPTIMIZE redundantly accumulates memory in HDFS WRITER
> ------------------------------------------------------
>
>                 Key: IMPALA-13598
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13598
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> When we have an Iceberg table that have lots of partitions, and we want to 
> compact the table via OPTIMIZE, it will use much more memory than needed.
> Repro steps:
> {noformat}
> create table tmp_ice_tpch
> partitioned by spec(truncate(500, l_orderkey))
> stored by iceberg as
> select * from tpch.lineitem;
> OPTIMIZE TABLE tmp_ice_tpch;
> # We likely get a Memory Limit Exceeded error here{noformat}
> Currently OPTIMIZE uses INSERT OVERWRITE under the hood:
> {noformat}
> INSERT OVERWRITE tmp_ice_tpch SELECT * FROM tmp_ice_tpch;{noformat}
> But INSERT OVERWRITE doesn't accumulate the memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13598) OPTIMIZE redundantly accumulates memory in HDFS WRITER

Reply via email to