This is an automated email from the ASF dual-hosted git repository. hope pushed a commit to branch release-1.4 in repository https://gitbox.apache.org/repos/asf/paimon.git
commit fcf405e54d6aef98b52c73b50d0a4c3a327fb7da Author: JingsongLi <[email protected]> AuthorDate: Wed Mar 25 23:10:11 2026 +0800 [doc] Update Documentations for Append table --- docs/content/append-table/bucketed.md | 89 ++++++------ docs/content/append-table/data-evolution.md | 2 +- .../content/append-table/incremental-clustering.md | 2 +- docs/content/append-table/overview.md | 157 ++++++++++++++++++++- docs/content/append-table/query-performance.md | 102 ------------- docs/content/append-table/streaming.md | 104 -------------- docs/content/append-table/update.md | 41 ------ docs/content/learn-paimon/understand-files.md | 2 +- 8 files changed, 198 insertions(+), 301 deletions(-) diff --git a/docs/content/append-table/bucketed.md b/docs/content/append-table/bucketed.md index 5643da0c05..04dc30699b 100644 --- a/docs/content/append-table/bucketed.md +++ b/docs/content/append-table/bucketed.md @@ -1,6 +1,6 @@ --- title: "Bucketed" -weight: 5 +weight: 3 type: docs aliases: - /append-table/bucketed.html @@ -46,7 +46,46 @@ CREATE TABLE my_table ( {{< /tab >}} {{< /tabs >}} -## Streaming +## Data Skipping + +The primary and most significant advantage of a bucketed append table is **data skipping**. When queries contain +equality (`=`) or `IN` filter conditions on the `bucket-key`, Paimon can efficiently push these predicates down to +skip irrelevant bucket files entirely. This means a large number of files that do not match the filter are pruned +before reading, drastically reducing I/O and accelerating queries. + +For example, if `bucket-key` is `product_id` and you query: + +```sql +SELECT * FROM my_table WHERE product_id = 12345; + +SELECT * FROM my_table WHERE product_id IN (1, 2, 3); +``` + +Paimon will only read the bucket that contains the matching `product_id` values, filtering out all other bucket files. +This is extremely effective when the table has many buckets and you are querying a small subset of bucket-key values. + +## Bucketed Join + +Bucketed table can also be used to accelerate join queries by avoiding costly shuffle operations in batch processing. +For example, you can use the following Spark SQL to read a Paimon table: + +```sql +SET spark.sql.sources.v2.bucketing.enabled = true; + +CREATE TABLE FACT_TABLE (order_id INT, f1 STRING) TBLPROPERTIES ('bucket'='10', 'bucket-key' = 'order_id'); + +CREATE TABLE DIM_TABLE (order_id INT, f2 STRING) TBLPROPERTIES ('bucket'='10', 'primary-key' = 'order_id'); + +SELECT * FROM FACT_TABLE JOIN DIM_TABLE on t1.order_id = t4.order_id; +``` + +The `spark.sql.sources.v2.bucketing.enabled` config is used to enable bucketing for V2 data sources. When turned on, +Spark will recognize the specific distribution reported by a V2 data source through SupportsReportPartitioning, and +will try to avoid shuffle if necessary. + +The costly join shuffle will be avoided if two tables have the same bucketing strategy and same number of buckets. + +## Bucketed Streaming An ordinary Append table has no strict ordering guarantees for its streaming writes and reads, but there are some cases where you need to define a key similar to Kafka's. @@ -57,43 +96,7 @@ bucket as a queue. {{< img src="/img/for-queue.png">}} -### Compaction in Bucket - -By default, the sink node will automatically perform compaction to control the number of files. The following options -control the strategy of compaction: - -<table class="configuration table table-bordered"> - <thead> - <tr> - <th class="text-left" style="width: 20%">Key</th> - <th class="text-left" style="width: 15%">Default</th> - <th class="text-left" style="width: 10%">Type</th> - <th class="text-left" style="width: 55%">Description</th> - </tr> - </thead> - <tbody> - <tr> - <td><h5>write-only</h5></td> - <td style="word-wrap: break-word;">false</td> - <td>Boolean</td> - <td>If set to true, compactions and snapshot expiration will be skipped. This option is used along with dedicated compact jobs.</td> - </tr> - <tr> - <td><h5>compaction.min.file-num</h5></td> - <td style="word-wrap: break-word;">5</td> - <td>Integer</td> - <td>For file set [f_0,...,f_N], the minimum file number to trigger a compaction for append table.</td> - </tr> - <tr> - <td><h5>full-compaction.delta-commits</h5></td> - <td style="word-wrap: break-word;">(none)</td> - <td>Integer</td> - <td>Full compaction will be constantly triggered after delta commits.</td> - </tr> - </tbody> -</table> - -### Streaming Read Order +**Streaming Read Order** For streaming reads, records are produced in the following order: @@ -103,7 +106,7 @@ For streaming reads, records are produced in the following order: * For any two records from the same partition and the same bucket, the first written record will be produced first. * For any two records from the same partition but two different buckets, different buckets are processed by different tasks, there is no order guarantee between them. -### Watermark Definition +**Watermark Definition** You can define watermark for reading Paimon tables: @@ -148,7 +151,7 @@ which will make sure no sources/splits/shards/partitions increase their watermar </tbody> </table> -### Bounded Stream +**Bounded Stream** Streaming Source can also be bounded, you can specify 'scan.bounded.watermark' to define the end condition for bounded streaming mode, stream reading will end until a larger watermark snapshot is encountered. @@ -170,7 +173,3 @@ INSERT INTO paimon_table SELECT * FROM kakfa_table; -- launch a bounded streaming job to read paimon_table SELECT * FROM paimon_table /*+ OPTIONS('scan.bounded.watermark'='...') */; ``` - -## Bucketed Join - -Bucketed table can be used to avoid shuffle if necessary in batch query, see [Bucketed Join]({{< ref "append-table/query-performance#bucketed-join" >}}). diff --git a/docs/content/append-table/data-evolution.md b/docs/content/append-table/data-evolution.md index fb3b91da5f..2356ef365d 100644 --- a/docs/content/append-table/data-evolution.md +++ b/docs/content/append-table/data-evolution.md @@ -84,7 +84,7 @@ This statement updates only the `b` column in the target table `target_table` ba `source_table`. The `id` column and `c` column remain unchanged, and new records are inserted with the specified values. The difference between this and table those are not enabled with data evolution is that only the `b` column data is written to new files. Note that: -* Data Evolution Table does not support 'Delete', 'Update', or 'Compact' statement yet. +* Data Evolution Table does not support 'Delete' and 'Update' statement yet. * Merge Into for Data Evolution Table does not support 'WHEN NOT MATCHED BY SOURCE' clause. ### Flink diff --git a/docs/content/append-table/incremental-clustering.md b/docs/content/append-table/incremental-clustering.md index 5358c040d1..232ee13d7c 100644 --- a/docs/content/append-table/incremental-clustering.md +++ b/docs/content/append-table/incremental-clustering.md @@ -1,6 +1,6 @@ --- title: "Incremental Clustering" -weight: 4 +weight: 2 type: docs aliases: - /append-table/incremental-clustering.html diff --git a/docs/content/append-table/overview.md b/docs/content/append-table/overview.md index 67d063c584..8644106e83 100644 --- a/docs/content/append-table/overview.md +++ b/docs/content/append-table/overview.md @@ -50,9 +50,154 @@ CREATE TABLE my_table ( Batch write and batch read in typical application scenarios, similar to a regular Hive partition table, but compared to the Hive table, it can bring: -1. Object storage (S3, OSS) friendly -2. Time Travel and Rollback -3. DELETE / UPDATE with low cost -4. Automatic small file merging in streaming sink -5. Streaming read & write like a queue -6. High performance query with order and index +1. Time travel enables reproducible queries that use exactly the same table snapshot, or lets users easily examine + changes. Version rollback allows users to quickly correct problems by resetting tables to a good state. +2. Scan planning is fast — data files are pruned with partition and column-level stats, using table metadata. File + Index (BloomFilter, Bitmap, Range Bitmap) and aggregate push-down further accelerate queries. +3. Schema evolution supports add, drop, update, or rename columns, and has no side-effects. +4. Rich ecosystem — adds tables to compute engines including Flink, Spark, Hive, Trino, Presto, StarRocks, and Doris, + working just like a SQL table. +5. Incremental Clustering with z-order/hilbert/order sorting to optimize data layout at low cost. +6. Streaming read & write like a queue, DELETE / UPDATE / MERGE INTO support low-cost row-level operations. + +--- +title: "Streaming" +weight: 2 +type: docs +aliases: +- /append-table/streaming.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## Append Streaming + +You can stream write to the Append table in a very flexible way through Flink, or read the Append table through +Flink, using it like a queue. The only difference is that its latency is in minutes. Its advantages are very low cost +and the ability to push down filters and projection. + +**Pre small files merging** + +"Pre" means that this compact occurs before committing files to the snapshot. + +If Flink's checkpoint interval is short (for example, 30 seconds), each snapshot may produce lots of small changelog +files. Too many files may put a burden on the distributed storage cluster. + +In order to compact small changelog files into large ones, you can set the table option `precommit-compact = true`. +Default value of this option is false, if true, it will add a compact coordinator and worker operator after the writer +operator, which copies changelog files into large ones. + +**Post small files merging** + +"Post" means that this compact occurs after committing files to the snapshot. + +In streaming write job, without bucket definition, there is no compaction in writer, instead, will use +`Compact Coordinator` to scan the small files and pass compaction task to `Compact Worker`. In streaming mode, if you +run insert sql in flink, the topology will be like this: + +{{< img src="/img/unaware-bucket-topo.png">}} + +Do not worry about backpressure, compaction never backpressure. + +If you set `write-only` to true, the `Compact Coordinator` and `Compact Worker` will be removed in the topology. + +The auto compaction is only supported in Flink engine streaming mode. You can also start a compaction job in Flink by +Flink action in Paimon and disable all the other compactions by setting `write-only`. + +**Streaming Query** + +You can stream the Append table and use it like a Message Queue. As with primary key tables, there are two options +for streaming reads: +1. By default, Streaming read produces the latest snapshot on the table upon first startup, and continue to read the + latest incremental records. +2. You can specify `scan.mode`, `scan.snapshot-id`, `scan.timestamp-millis` and/or `scan.file-creation-time-millis` to + stream read incremental only. + +Similar to flink-kafka, order is not guaranteed by default, if your data has some sort of order requirement, you also +need to consider defining a `bucket-key`, see [Bucketed Append]({{< ref "append-table/bucketed" >}}) + +## Aggregate push down + +Append Table supports aggregate push down: + +```sql +SELECT COUNT(*) FROM TABLE WHERE DT = '20230101'; +``` + +This query can be accelerated during compilation and returns very quickly. + +For Spark SQL, table with default `metadata.stats-mode` can be accelerated: + +```sql +SELECT MIN(a), MAX(b) FROM TABLE WHERE DT = '20230101'; + +SELECT * FROM TABLE ORDER BY a LIMIT 1; +``` + +Min max topN query can be also accelerated during compilation and returns very quickly. + +## Data Skipping By Order + +Paimon by default records the maximum and minimum values of each field in the manifest file. + +In the query, according to the `WHERE` condition of the query, together with the statistics in the manifest we can +perform file filtering. If the filtering effect is good, the query that would have cost minutes will be accelerated to +milliseconds to complete the execution. + +Often the data distribution is not always ideal for filtering, so can we sort the data by the field in `WHERE` condition? +You can take a look at [Flink COMPACT Action]({{< ref "maintenance/dedicated-compaction#sort-compact" >}}), +[Flink COMPACT Procedure]({{< ref "flink/procedures" >}}) or [Spark COMPACT Procedure]({{< ref "spark/procedures" >}}). + +## Data Skipping By File Index + +You can use file index too, it filters files by indexing on the reading side. + +Define `file-index.bitmap.columns`, Data file index is an external index file and Paimon will create its +corresponding index file for each file. If the index file is too small, it will be stored directly in the manifest, +otherwise in the directory of the data file. Each data file corresponds to an index file, which has a separate file +definition and can contain different types of indexes with multiple columns. + +Different file indexes may be efficient in different scenarios. For example bloom filter may speed up query in point lookup +scenario. Using a bitmap may consume more space but can result in greater accuracy. + +* [BloomFilter]({{< ref "concepts/spec/fileindex#index-bloomfilter" >}}): `file-index.bloom-filter.columns`. +* [Bitmap]({{< ref "concepts/spec/fileindex#index-bitmap" >}}): `file-index.bitmap.columns`. +* [Range Bitmap]({{< ref "concepts/spec/fileindex#index-range-bitmap" >}}): `file-index.range-bitmap.columns`. + +If you want to add file index to existing table, without any rewrite, you can use `rewrite_file_index` procedure. Before +we use the procedure, you should config appropriate configurations in target table. You can use ALTER clause to config +`file-index.<filter-type>.columns` to the table. + +How to invoke: see [flink procedures]({{< ref "flink/procedures#procedures" >}}) + +## Row Level Operations + +Now, only Spark SQL supports DELETE & UPDATE & MERGE INTO, you can take a look at [Spark Write]({{< ref "spark/sql-write" >}}). + +Example: +```sql +DELETE FROM my_table WHERE currency = 'UNKNOWN'; +``` + +Update append table has two modes: + +1. COW (Copy on Write): search for the hit files and then rewrite each file to remove the data that needs to be deleted + from the files. This operation is costly. +2. MOW (Merge on Write): By specifying `'deletion-vectors.enabled' = 'true'`, the Deletion Vectors mode can be enabled. + Only marks certain records of the corresponding file for deletion and writes the deletion file, without rewriting the entire file. diff --git a/docs/content/append-table/query-performance.md b/docs/content/append-table/query-performance.md deleted file mode 100644 index aad62636e5..0000000000 --- a/docs/content/append-table/query-performance.md +++ /dev/null @@ -1,102 +0,0 @@ ---- -title: "Query Performance" -weight: 3 -type: docs -aliases: -- /append-table/query-performance.html ---- -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -# Query Performance - -## Aggregate push down - -Append Table supports aggregate push down: - -```sql -SELECT COUNT(*) FROM TABLE WHERE DT = '20230101'; -``` - -This query can be accelerated during compilation and returns very quickly. - -For Spark SQL, table with default `metadata.stats-mode` can be accelerated: - -```sql -SELECT MIN(a), MAX(b) FROM TABLE WHERE DT = '20230101'; - -SELECT * FROM TABLE ORDER BY a LIMIT 1; -``` - -Min max topN query can be also accelerated during compilation and returns very quickly. - -## Data Skipping By Order - -Paimon by default records the maximum and minimum values of each field in the manifest file. - -In the query, according to the `WHERE` condition of the query, together with the statistics in the manifest we can -perform file filtering. If the filtering effect is good, the query that would have cost minutes will be accelerated to -milliseconds to complete the execution. - -Often the data distribution is not always ideal for filtering, so can we sort the data by the field in `WHERE` condition? -You can take a look at [Flink COMPACT Action]({{< ref "maintenance/dedicated-compaction#sort-compact" >}}), -[Flink COMPACT Procedure]({{< ref "flink/procedures" >}}) or [Spark COMPACT Procedure]({{< ref "spark/procedures" >}}). - -## Data Skipping By File Index - -You can use file index too, it filters files by indexing on the reading side. - -Define `file-index.bitmap.columns`, Data file index is an external index file and Paimon will create its -corresponding index file for each file. If the index file is too small, it will be stored directly in the manifest, -otherwise in the directory of the data file. Each data file corresponds to an index file, which has a separate file -definition and can contain different types of indexes with multiple columns. - -Different file indexes may be efficient in different scenarios. For example bloom filter may speed up query in point lookup -scenario. Using a bitmap may consume more space but can result in greater accuracy. - -* [BloomFilter]({{< ref "concepts/spec/fileindex#index-bloomfilter" >}}): `file-index.bloom-filter.columns`. -* [Bitmap]({{< ref "concepts/spec/fileindex#index-bitmap" >}}): `file-index.bitmap.columns`. -* [Range Bitmap]({{< ref "concepts/spec/fileindex#index-range-bitmap" >}}): `file-index.range-bitmap.columns`. - -If you want to add file index to existing table, without any rewrite, you can use `rewrite_file_index` procedure. Before -we use the procedure, you should config appropriate configurations in target table. You can use ALTER clause to config -`file-index.<filter-type>.columns` to the table. - -How to invoke: see [flink procedures]({{< ref "flink/procedures#procedures" >}}) - -## Bucketed Join - -Bucketed table can be used to avoid shuffle if necessary in batch query, for example, you can use the following Spark -SQL to read a Paimon table: - -```sql -SET spark.sql.sources.v2.bucketing.enabled = true; - -CREATE TABLE FACT_TABLE (order_id INT, f1 STRING) TBLPROPERTIES ('bucket'='10', 'bucket-key' = 'order_id'); - -CREATE TABLE DIM_TABLE (order_id INT, f2 STRING) TBLPROPERTIES ('bucket'='10', 'primary-key' = 'order_id'); - -SELECT * FROM FACT_TABLE JOIN DIM_TABLE on t1.order_id = t4.order_id; -``` - -The `spark.sql.sources.v2.bucketing.enabled` config is used to enable bucketing for V2 data sources. When turned on, -Spark will recognize the specific distribution reported by a V2 data source through SupportsReportPartitioning, and -will try to avoid shuffle if necessary. - -The costly join shuffle will be avoided if two tables have the same bucketing strategy and same number of buckets. diff --git a/docs/content/append-table/streaming.md b/docs/content/append-table/streaming.md deleted file mode 100644 index 49f44a7f13..0000000000 --- a/docs/content/append-table/streaming.md +++ /dev/null @@ -1,104 +0,0 @@ ---- -title: "Streaming" -weight: 2 -type: docs -aliases: -- /append-table/streaming.html ---- -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -# Streaming - -You can stream write to the Append table in a very flexible way through Flink, or read the Append table through -Flink, using it like a queue. The only difference is that its latency is in minutes. Its advantages are very low cost -and the ability to push down filters and projection. - -## Pre small files merging - -"Pre" means that this compact occurs before committing files to the snapshot. - -If Flink's checkpoint interval is short (for example, 30 seconds), each snapshot may produce lots of small changelog -files. Too many files may put a burden on the distributed storage cluster. - -In order to compact small changelog files into large ones, you can set the table option `precommit-compact = true`. -Default value of this option is false, if true, it will add a compact coordinator and worker operator after the writer -operator, which copies changelog files into large ones. - -## Post small files merging - -"Post" means that this compact occurs after committing files to the snapshot. - -In streaming write job, without bucket definition, there is no compaction in writer, instead, will use -`Compact Coordinator` to scan the small files and pass compaction task to `Compact Worker`. In streaming mode, if you -run insert sql in flink, the topology will be like this: - -{{< img src="/img/unaware-bucket-topo.png">}} - -Do not worry about backpressure, compaction never backpressure. - -If you set `write-only` to true, the `Compact Coordinator` and `Compact Worker` will be removed in the topology. - -The auto compaction is only supported in Flink engine streaming mode. You can also start a compaction job in Flink by -Flink action in Paimon and disable all the other compactions by setting `write-only`. - -The following options control the strategy of compaction: - -<table class="configuration table table-bordered"> - <thead> - <tr> - <th class="text-left" style="width: 20%">Key</th> - <th class="text-left" style="width: 15%">Default</th> - <th class="text-left" style="width: 10%">Type</th> - <th class="text-left" style="width: 55%">Description</th> - </tr> - </thead> - <tbody> - <tr> - <td><h5>write-only</h5></td> - <td style="word-wrap: break-word;">false</td> - <td>Boolean</td> - <td>If set to true, compactions and snapshot expiration will be skipped. This option is used along with dedicated compact jobs.</td> - </tr> - <tr> - <td><h5>compaction.min.file-num</h5></td> - <td style="word-wrap: break-word;">5</td> - <td>Integer</td> - <td>For file set [f_0,...,f_N], the minimum file number to trigger a compaction for append table.</td> - </tr> - <tr> - <td><h5>compaction.delete-ratio-threshold</h5></td> - <td style="word-wrap: break-word;">(none)</td> - <td>Double</td> - <td>Ratio of the deleted rows in a data file to be forced compacted.</td> - </tr> - </tbody> -</table> - -## Streaming Query - -You can stream the Append table and use it like a Message Queue. As with primary key tables, there are two options -for streaming reads: -1. By default, Streaming read produces the latest snapshot on the table upon first startup, and continue to read the - latest incremental records. -2. You can specify `scan.mode`, `scan.snapshot-id`, `scan.timestamp-millis` and/or `scan.file-creation-time-millis` to - stream read incremental only. - -Similar to flink-kafka, order is not guaranteed by default, if your data has some sort of order requirement, you also -need to consider defining a `bucket-key`, see [Bucketed Append]({{< ref "append-table/bucketed" >}}) diff --git a/docs/content/append-table/update.md b/docs/content/append-table/update.md deleted file mode 100644 index 5e373cbd69..0000000000 --- a/docs/content/append-table/update.md +++ /dev/null @@ -1,41 +0,0 @@ ---- -title: "Update" -weight: 4 -type: docs -aliases: -- /append-table/update.html ---- -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -# Update - -Now, only Spark SQL supports DELETE & UPDATE, you can take a look at [Spark Write]({{< ref "spark/sql-write" >}}). - -Example: -```sql -DELETE FROM my_table WHERE currency = 'UNKNOWN'; -``` - -Update append table has two modes: - -1. COW (Copy on Write): search for the hit files and then rewrite each file to remove the data that needs to be deleted - from the files. This operation is costly. -2. MOW (Merge on Write): By specifying `'deletion-vectors.enabled' = 'true'`, the Deletion Vectors mode can be enabled. - Only marks certain records of the corresponding file for deletion and writes the deletion file, without rewriting the entire file. diff --git a/docs/content/learn-paimon/understand-files.md b/docs/content/learn-paimon/understand-files.md index a8a9ac52ce..58fed51b84 100644 --- a/docs/content/learn-paimon/understand-files.md +++ b/docs/content/learn-paimon/understand-files.md @@ -488,7 +488,7 @@ this means that there are at least 5 files in a bucket. If you want to reduce th By default, Append also does automatic compaction to reduce the number of small files. However, for Bucketed Append table, it will only compact the files within the Bucket for sequential -purposes, which may keep more small files. See [Bucketed Append]({{< ref "append-table/streaming#bucketed-append" >}}). +purposes, which may keep more small files. See [Bucketed Append]({{< ref "append-table/bucketed" >}}). ### Understand Full-Compaction
