[ 
https://issues.apache.org/jira/browse/IMPALA-14075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956184#comment-17956184
 ] 

Riza Suminto commented on IMPALA-14075:
---------------------------------------

We can use ThreadPools.getWorkerPool() from Iceberg library. I think it is OK 
since CatalogD does not use it for planFiles.

> Parallelize delete operations of EXPIRE_SNAPSHOTS
> -------------------------------------------------
>
>                 Key: IMPALA-14075
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14075
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Riza Suminto
>            Priority: Major
>              Labels: impala-iceberg
>
> Currently Impala executes EXPIRE_SNAPSHOTS operation on a single thread. It 
> can be really slow on cloud storage systems, especially if the operation 
> needs to remove lots of files.
> It is possible to run the delete operations in parallel by passing an 
> ExecutorService object to ExpireSnapshots:
> {noformat}
> ExpireSnapshots executeDeleteWith(ExecutorService executorService);{noformat}
> [https://github.com/apache/iceberg/blob/31c315f695aad544a096a5a2ffdde54a97b90b28/api/src/main/java/org/apache/iceberg/ExpireSnapshots.java#L100]
> For reference, Hive uses 4 threads to execute the deletes:
> [https://github.com/apache/hive/blob/08067725bc6e8810579324736a0aac453c06bf7b/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2239-L2241]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to