[ 
https://issues.apache.org/jira/browse/IMPALA-14075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17959604#comment-17959604
 ] 

ASF subversion and git services commented on IMPALA-14075:
----------------------------------------------------------

Commit ccb8eac10a4ffdce61dd8fb1c359969b6ba2c77e in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ccb8eac10 ]

IMPALA-14075: Add CatalogOpExecutor.icebergExecutorService_

Before this patch, Impala executes EXPIRE_SNAPSHOTS operation on a
single thread. It can be really slow on cloud storage systems,
especially if the operation needs to remove lots of files.

This patch adds CatalogOpExecutor.icebergExecutorService_ to parallelize
Iceberg API call that supports passing ExecutorService, such as
ExpireSnapshots.executeDeleteWith(). Number of threads for this executor
service is controlled by CatalogD flag --iceberg_catalog_num_threads. It
is default to 16, same as --num_metadata_loading_threads default value.

Rename ValidateMinProcessingPerThread to ValidatePositiveInt64 to match
with other validators in backend-gflag-util.cc.

Testing:
- Lower sleep time between insert queries from 5s to 1s in
  test_expire_snapshots and test_describe_history_params to speed up
  tests.
- Manually verify that 'IcebergCatalogThread' threads are visible in
  /jvm-threadz page of CatalogD.
- Pass test_iceberg.py.

Change-Id: I6dcbf1e406e1732ef8829eb0cd627d932291d485
Reviewed-on: http://gerrit.cloudera.org:8080/22980
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Parallelize delete operations of EXPIRE_SNAPSHOTS
> -------------------------------------------------
>
>                 Key: IMPALA-14075
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14075
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Riza Suminto
>            Priority: Major
>              Labels: impala-iceberg
>             Fix For: Impala 5.0.0
>
>
> Currently Impala executes EXPIRE_SNAPSHOTS operation on a single thread. It 
> can be really slow on cloud storage systems, especially if the operation 
> needs to remove lots of files.
> It is possible to run the delete operations in parallel by passing an 
> ExecutorService object to ExpireSnapshots:
> {noformat}
> ExpireSnapshots executeDeleteWith(ExecutorService executorService);{noformat}
> [https://github.com/apache/iceberg/blob/31c315f695aad544a096a5a2ffdde54a97b90b28/api/src/main/java/org/apache/iceberg/ExpireSnapshots.java#L100]
> For reference, Hive uses 4 threads to execute the deletes:
> [https://github.com/apache/hive/blob/08067725bc6e8810579324736a0aac453c06bf7b/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2239-L2241]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to