This is an automated email from the ASF dual-hosted git repository.
michaelsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
The following commit(s) were added to refs/heads/master by this push:
new 2dded9209 IMPALA-13392: Document File Filtering in OPTIMIZE Statement
2dded9209 is described below
commit 2dded92093ae9b9db4deee2c6b5eb61b4c9133f2
Author: Noemi Pap-Takacs <[email protected]>
AuthorDate: Wed Sep 25 13:24:17 2024 +0200
IMPALA-13392: Document File Filtering in OPTIMIZE Statement
Document the feature added in 'IMPALA-12867: Filter files to
OPTIMIZE based on file size'.
Change-Id: I73f88adedaf48909784baaf42488cb96defddfc3
Reviewed-on: http://gerrit.cloudera.org:8080/21852
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Daniel Becker <[email protected]>
---
docs/topics/impala_iceberg.xml | 33 +++++++++++++++++++++------------
1 file changed, 21 insertions(+), 12 deletions(-)
diff --git a/docs/topics/impala_iceberg.xml b/docs/topics/impala_iceberg.xml
index 0c0ce344a..b333320b3 100644
--- a/docs/topics/impala_iceberg.xml
+++ b/docs/topics/impala_iceberg.xml
@@ -555,18 +555,22 @@ UPDATE ice_t SET ice_t.k = o.k, ice_t.j = o.j, FROM
ice_t, other_table o where i
This causes read performance to degrade over time.
The following statement can be used to compact the table and optimize
it for reading.
<codeblock>
-OPTIMIZE TABLE [<varname>db_name</varname>.]<varname>table_name</varname>;
+OPTIMIZE TABLE [<varname>db_name</varname>.]<varname>table_name</varname>
[FILE_SIZE_THRESHOLD_MB=<varname>value</varname>];
</codeblock>
</p>
<p>
- The current implementation of the <codeph>OPTIMIZE TABLE</codeph>
statement rewrites
- the entire table, executing the following tasks:
+ The <codeph>OPTIMIZE TABLE</codeph> statement rewrites the table,
executing the
+ following tasks:
<ul>
- <li>compact small files</li>
- <li>merge delete and update deltas</li>
- <li>rewrite all files, converting them to the latest table
schema</li>
- <li>rewrite all partitions according to the latest partition
spec</li>
+ <li>Merges delete files with the corresponding data files.</li>
+ <li>Compacts data files that are smaller than the specified file
size threshold in megabytes.</li>
+ </ul>
+ If no <codeph>FILE_SIZE_THRESHOLD_MB</codeph> was specified, the
command compacts
+ ALL files and also
+ <ul>
+ <li>Converts data files to the latest table schema.</li>
+ <li>Rewrites all partitions according to the latest partition
spec.</li>
</ul>
</p>
@@ -574,9 +578,9 @@ OPTIMIZE TABLE
[<varname>db_name</varname>.]<varname>table_name</varname>;
To execute table optimization:
<ul>
<li>The user needs ALL privileges on the table.</li>
- <li>The table can conatin any file formats that Impala can read, but
<codeph>write.format.default</codeph>
+ <li>The table can contain any file formats that Impala can read, but
<codeph>write.format.default</codeph>
has to be <codeph>parquet</codeph>.</li>
- <li>The table cannot contain complex types.</li>
+ <li>General write limitations apply, e.g. the table cannot contain
complex types.</li>
</ul>
</p>
@@ -584,11 +588,15 @@ OPTIMIZE TABLE
[<varname>db_name</varname>.]<varname>table_name</varname>;
When a table is optimized, a new snapshot is created. The old table
state is still
accessible by time travel to previous snapshots, because the rewritten
data and
delete files are not removed physically.
+ Issue the <codeph>ALTER TABLE ... EXECUTE
expire_snapshots(...)</codeph> command
+ to remove the old files from the file system.
</p>
<p>
- Note that the current implementation of <codeph>OPTIMIZE
TABLE</codeph> rewrites
- the entire table, therefore this operation can take a long time to
complete
+ Note that <codeph>OPTIMIZE TABLE</codeph> without a specified
<codeph>FILE_SIZE_THRESHOLD_MB</codeph>
+ rewrites the entire table, therefore the operation can take a long
time to complete
depending on the size of the table.
+ It is recommended to specify a file size threshold for recurring table
maintenance
+ jobs to save resources.
</p>
</conbody>
</concept>
@@ -700,8 +708,9 @@ ALTER TABLE ice_tbl EXECUTE expire_snapshots(now() -
interval 5 days);
<p>
Expire snapshots:
<ul>
- <li>does not remove old metadata files by default.</li>
+ <li>removes data files that are no longer referenced by non-expired
snapshots.</li>
<li>does not remove orphaned data files.</li>
+ <li>does not remove old metadata files by default.</li>
<li>respects the minimum number of snapshots to keep:
<codeph>history.expire.min-snapshots-to-keep</codeph> table
property.</li>
</ul>