[
https://issues.apache.org/jira/browse/IMPALA-14075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17959604#comment-17959604
]
ASF subversion and git services commented on IMPALA-14075:
----------------------------------------------------------
Commit ccb8eac10a4ffdce61dd8fb1c359969b6ba2c77e in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ccb8eac10 ]
IMPALA-14075: Add CatalogOpExecutor.icebergExecutorService_
Before this patch, Impala executes EXPIRE_SNAPSHOTS operation on a
single thread. It can be really slow on cloud storage systems,
especially if the operation needs to remove lots of files.
This patch adds CatalogOpExecutor.icebergExecutorService_ to parallelize
Iceberg API call that supports passing ExecutorService, such as
ExpireSnapshots.executeDeleteWith(). Number of threads for this executor
service is controlled by CatalogD flag --iceberg_catalog_num_threads. It
is default to 16, same as --num_metadata_loading_threads default value.
Rename ValidateMinProcessingPerThread to ValidatePositiveInt64 to match
with other validators in backend-gflag-util.cc.
Testing:
- Lower sleep time between insert queries from 5s to 1s in
test_expire_snapshots and test_describe_history_params to speed up
tests.
- Manually verify that 'IcebergCatalogThread' threads are visible in
/jvm-threadz page of CatalogD.
- Pass test_iceberg.py.
Change-Id: I6dcbf1e406e1732ef8829eb0cd627d932291d485
Reviewed-on: http://gerrit.cloudera.org:8080/22980
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Parallelize delete operations of EXPIRE_SNAPSHOTS
> -------------------------------------------------
>
> Key: IMPALA-14075
> URL: https://issues.apache.org/jira/browse/IMPALA-14075
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Zoltán Borók-Nagy
> Assignee: Riza Suminto
> Priority: Major
> Labels: impala-iceberg
> Fix For: Impala 5.0.0
>
>
> Currently Impala executes EXPIRE_SNAPSHOTS operation on a single thread. It
> can be really slow on cloud storage systems, especially if the operation
> needs to remove lots of files.
> It is possible to run the delete operations in parallel by passing an
> ExecutorService object to ExpireSnapshots:
> {noformat}
> ExpireSnapshots executeDeleteWith(ExecutorService executorService);{noformat}
> [https://github.com/apache/iceberg/blob/31c315f695aad544a096a5a2ffdde54a97b90b28/api/src/main/java/org/apache/iceberg/ExpireSnapshots.java#L100]
> For reference, Hive uses 4 threads to execute the deletes:
> [https://github.com/apache/hive/blob/08067725bc6e8810579324736a0aac453c06bf7b/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2239-L2241]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]