[GitHub] [hudi] yihua commented on a diff in pull request #8022: [HUDI-5833] Add 0.13.0 release notes

via GitHub Fri, 24 Feb 2023 12:50:35 -0800


yihua commented on code in PR #8022:
URL: https://github.com/apache/hudi/pull/8022#discussion_r1117629977



##########
website/releases/release-0.13.0.md:
##########
@@ -0,0 +1,506 @@
+---
+title: "Release 0.13.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2022-02-22T13:00:00-08:00
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# [Release 0.13.0](https://github.com/apache/hudi/releases/tag/release-0.13.0) 
([docs](/docs/quick-start-guide))
+
+Apache Hudi 0.13.0 release introduces a number of new features including 
[Metaserver](#metaserver),
+[Change Data Capture](#change-data-capture), [new Record Merge 
API](#optimizing-record-payload-handling),
+[new sources for Deltastreamer](#new-source-support-in-deltastreamer) and 
more.  While there is no table version upgrade
+required for this release, users are expected to take actions by following the 
[Migration Guide](#migration-guide-overview)
+down below on relevant [breaking changes](#migration-guide-breaking-changes) 
and
+[behavior changes](#migration-guide-behavior-changes) before using 0.13.0 
release.
+
+## Migration Guide: Overview
+
+This release keeps the same table version (`5`) as [0.12.0 
release](/releases/release-0.12.0), and there is no need for
+a table version upgrade if you are upgrading from 0.12.0.  There are a few
+[breaking changes](#migration-guide-breaking-changes) and [behavior 
changes](#migration-guide-behavior-changes) as
+described below, and users are expected to take action accordingly before 
using 0.13.0 release.
+
+:::caution
+If migrating from an older release (pre 0.12.0), please also check the upgrade 
instructions from each older release in
+sequence.
+:::
+
+## Migration Guide: Breaking Changes
+
+### Bundle Updates
+
+#### Spark bundle Support
+
+From now on, 
[`hudi-spark3.2-bundle`](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.2-bundle)
 works
+with Apache Spark 3.2.1 and newer versions for Spark 3.2.x.  The support for 
Spark 3.2.0 with
+[`hudi-spark3.2-bundle`](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark3.2-bundle)
 is
+dropped because of the Spark implementation change of `getHive` method of 
`HiveClientImpl` which is incompatible between
+Spark version 3.2.0 and 3.2.1.
+
+#### Utilities Bundle Change
+
+The AWS and GCP bundle jars are separated from
+[`hudi-utilities-bundle`](https://mvnrepository.com/artifact/org.apache.hudi/hudi-utilities-bundle).
 The user would need
+to use 
[**`hudi-aws-bundle`**](https://mvnrepository.com/artifact/org.apache.hudi/hudi-aws-bundle)
 or
+[**`hudi-gcp-bundle`**](https://mvnrepository.com/artifact/org.apache.hudi/hudi-gcp-bundle)
 along with
+[`hudi-utilities-bundle`](https://mvnrepository.com/artifact/org.apache.hudi/hudi-utilities-bundle)
 while using the
+cloud services.
+
+#### New Flink Bundle
+
+Hudi is now supported on Flink 1.16.x with the new
+[`hudi-flink1.16-bundle`](https://mvnrepository.com/artifact/org.apache.hudi/hudi-flink1.16-bundle).
+
+### Lazy File Index in Spark
+
+Hudi's File Index in Spark is switched to be listed lazily ***by default***: 
this entails that it would **only** be listing
+partitions that are requested by the query (i.e., after partition-pruning) as 
opposed to always listing the whole table
+before this release. This is expected to bring considerable performance 
improvement for large tables.
+
+A new configuration property is added if the user wants to change the listing 
behavior:
+`hoodie.datasource.read.file.index.listing.mode` (now default to **`lazy`**). 
There are two possible values that you can
+set:
+
+- **`eager`**: This lists all partition paths and corresponding file slices 
within them eagerly, during initialization. 
+This is the default behavior prior 0.13.0.
+  - If a Hudi table has 1000 partitions, the eager mode lists the files under 
all of them when constructing the file index.  
+
+- **`lazy`**: The partitions and file-slices within them will be listed 
lazily, allowing partition pruning predicates to
+be pushed down appropriately, therefore only listing partitions after these 
have already been pruned.
+  - The files are not listed under the partitions when the File Index is 
initialized. The files are listed only under
+    targeted partition(s) after partition pruning using predicates (e.g., 
`datestr=2023-02-19`) in queries.
+
+:::tip
+To preserve the behavior pre 0.13.0, the user needs to set 
`hoodie.datasource.read.file.index.listing.mode=eager`.
+:::
+
+:::danger Breaking Change
+The **breaking change** occurs only in cases when the table has **BOTH**: 
multiple partition columns AND partition
+values contain slashes that are not URL-encoded.
+:::
+
+For example let's assume we want to parse two partition columns - `month` 
(`2022/01`) and `day` (`03`), from the
+partition path `2022/01/03`. Since there is a mismatch between the number of 
partition columns (2 here - `month` and
+`day`) and the number of components in the partition path delimited by `/` (3 
in this case - month, year and day) it
+causes ambiguity. In such cases, it is not possible to recover the partition 
value corresponding to each partition column.
+
+There are two ways to **avoid** the breaking changes:
+
+- The first option is to change how partition values are constructed. A user 
can switch the partition value of column
+`month` to avoid slashes in any partition column values, such as `202201`, 
then there is no problem parsing the
+partition path (`202201/03`).  
+
+- The second option is to switch the listing mode to `eager`.  The File Index 
would "gracefully regress" to assume the
+table is non-partitioned and just sacrifice partition-pruning, but would be 
able to process the query as if the
+table was non-partitioned (therefore potentially incurring performance 
penalty), instead of failing the queries.
+
+### Checkpoint Management in Spark Structured Streaming
+
+If you are using [Spark 
streaming](https://spark.apache.org/docs/3.3.2/structured-streaming-programming-guide.html)
 to
+ingest into Hudi, Hudi self-manages the checkpoint internally. We are now 
adding support for multiple writers, each
+ingesting into the same Hudi table via streaming ingest. In older versions of 
hudi, you can't have multiple streaming
+ingestion writers ingesting into the same hudi table (one streaming ingestion 
writer with a concurrent Spark datasource
+writer works with lock provider; however, two Spark streaming ingestion 
writers are not supported). With 0.13.0, we are
+adding support where multiple streaming ingestions can be done to the same 
table. In case of a single streaming ingestion,
+users don't have to do anything; the old pipeline will work without needing 
any additional changes. But, if you are
+having multiple streaming writers to same Hudi table, each table has to set a 
unique value for the config,
+`hoodie.datasource.write.streaming.checkpoint.identifier`. Also, users are 
expected to set the usual multi-writer
+configs. More details can be found [here](/docs/concurrency_control).
+
+### ORC Support in Spark
+
+The [ORC](https://orc.apache.org/) support for Spark 2.x is removed in this 
release, as the dependency of
+`orc-core:nohive` in Hudi is now replaced by  `orc-core`, to be compatible 
with Spark 3.  [ORC](https://orc.apache.org/)
+support is now available for Spark 3.x, which was broken in previous releases.
+
+### Mandatory Record Key Field
+
+The configuration for setting the record key field, 
`hoodie.datasource.write.recordkey.field`, is now required to be set
+and has no default value. Previously, the default value is `uuid`.
+
+## Migration Guide: Behavior Changes
+
+### Schema Handling in Write Path
+
+Many users have requested using Hudi for CDC use cases that they want to have 
schema auto-evolution where existing
+columns might be dropped in a new schema. As of 0.13.0 release, Hudi now has 
this functionality. You can permit schema
+auto-evolution where existing columns can be dropped in a new schema.
+
+Since dropping columns in the target table based on the source schema 
constitutes a considerable behavior change, this
+is disabled by default and is guarded by the following config: 
`hoodie.datasource.write.schema.allow.auto.evolution.column.drop`.
+To enable automatic dropping of the columns along with new evolved schema of 
the incoming batch, set this to **`true`**.
+
+:::tip
+This config is **NOT** required to evolve schema manually by using, for 
example, `ALTER TABLE … DROP COLUMN` in Spark.
+:::
+
+### Removal of Default Shuffle Parallelism
+
+This release changes how Hudi decides the shuffle parallelism of [write 
operations](/docs/write_operations) including
+`INSERT`, `BULK_INSERT`, `UPSERT` and `DELETE` 
(**`hoodie.insert|bulkinsert|upsert|delete.shuffle.parallelism`**), which
+can ultimately affect the write performance.
+
+Previously, if users did not configure it, Hudi would use `200` as the default 
shuffle parallelism. From 0.13.0 onwards
+Hudi by default automatically deduces the shuffle parallelism by either using 
the number of output RDD partitions as
+determined by Spark when available or by using the `spark.default.parallelism` 
value.  If the above Hudi shuffle
+parallelisms are explicitly configured by the user, then the user-configured 
parallelism is still used in defining the
+actual parallelism.  Such behavior change improves the out-of-the-box 
performance by 20% for workloads with reasonably
+sized input.
+
+:::caution
+If the input data files are small, e.g., smaller than 10MB, we suggest 
configuring the Hudi shuffle parallelism
+(`hoodie.insert|bulkinsert|upsert|delete.shuffle.parallelism`) explicitly, 
such that the parallelism is at least
+total_input_data_size/500MB, to avoid potential performance regression (see 
[Tuning Guide](/docs/tuning-guide) for more
+information).
+:::
+
+### Simple Write Executor as Default
+
+For the execution of insert/upsert operations, Hudi historically used the 
notion of an executor, relying on in-memory
+queue to decouple ingestion operations (that were previously often bound by 
I/O operations fetching shuffled blocks)
+from writing operations. Since then, Spark architectures have evolved 
considerably making such writing architecture
+redundant. Towards evolving this writing pattern and leveraging the changes in 
Spark, in 0.13.0 we introduce a new,
+simplified version of the executor named (creatively) as **`SimpleExecutor`** 
and also make it out-of-the-box default.
+
+The **`SimpleExecutor`** does not have any internal buffering (i.e., does not 
hold records in memory), which internally
+implements simple iteration over provided iterator (similar to default Spark 
behavior).  It provides **~10%**
+out-of-the-box performance improvement on modern Spark versions (3.x) and even 
more when used with Spark's native
+**`SparkRecordMerger`**.
+
+### `NONE` Sort Mode for Bulk Insert to Match Parquet Writes
+
+This release adjusts the parallelism for `NONE` sort mode (default sort mode) 
for `BULK_INSERT` write operation. From
+now on, by default, the input parallelism is used instead of the shuffle 
parallelism (`hoodie.bulkinsert.shuffle.parallelism`)
+for writing data, to match the default parquet write behavior. This does not 
change the behavior of clustering using the
+`NONE` sort mode.
+
+Such behavior change on `BULK_INSERT` write operation improves the write 
performance out of the box.
+
+:::tip
+If you still observe small file issues with the default `NONE` sort mode, we 
suggest sorting the input data based on the
+partition path and record key before writing to the Hudi table. You can also 
usee `GLOBAL_SORT` to ensure the best file

Review Comment:
   Good catch!  Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on a diff in pull request #8022: [HUDI-5833] Add 0.13.0 release notes

Reply via email to