(hudi) branch master updated: [MINOR] Add a brief abstract to RFC-66 (#10386)

danny0405 Wed, 20 Dec 2023 21:38:23 -0800

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/master by this push:
     new b8211b00bce [MINOR] Add a brief abstract to RFC-66 (#10386)
b8211b00bce is described below

commit b8211b00bce92e189297224d2b0e2361620c5bde
Author: Lin Liu <[email protected]>
AuthorDate: Wed Dec 20 21:37:21 2023 -0800

    [MINOR] Add a brief abstract to RFC-66 (#10386)
---
 rfc/rfc-66/rfc-66.md | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/rfc/rfc-66/rfc-66.md b/rfc/rfc-66/rfc-66.md
index d3754ca15af..8ac57f01e4d 100644
--- a/rfc/rfc-66/rfc-66.md
+++ b/rfc/rfc-66/rfc-66.md
@@ -13,12 +13,18 @@ JIRA: [Lockless multi writer 
support](https://issues.apache.org/jira/browse/HUDI
 
 ## Abstract
 
+Apache Hudi community has seen instant time based Optimistic Concurrency 
Control (OCC) could cause serious write-write conflicts for important
+application scenarios, e.g., multi-writer ingestion. In this RFC, we propose a 
completion time based Non-Blocking Concurrency Control (NBCC) scheme,
+which not only brings more practical semantics to various Hudi operations, 
like `FileSlice` ordering, but also much higher performance
+for multi-writer operations, like frequent writes with non-blocked compactions.
+
 ## Background
-As you know, Hudi already supports basic OCC with abundant lock providers.
-However, for multi-writer streaming ingestion, the OCC does not work well 
because conflicts would happen very frequently.
-For hashing index, all the writers utilize a deterministic hashing algorithm 
on primary keys to distribute records.
-In normal cases, these keys are evenly distributed into all data buckets. That 
means, in a single data flushing, one writer could append to
-all the data buckets, and conflicts happen when there multiple such writers.
+Apache Hudi already supports basic OCC with a number of lock providers.
+However, in the scenario of multi-writer streaming ingestion, our OCC does not 
work well because
+write-write conflicts would happen frequently.
+For hashing index, writers utilize a deterministic hashing algorithm based on 
primary keys to distribute records.
+In normal cases, these keys are evenly distributed among all data buckets. 
That means, in a single data flushing,
+one writer could append to all data buckets, and conflicts could happen when 
there are multiple such writers.
 For bloom filter index, the situation is slightly different. We write into the 
**small** bucket in higher priority using a small-file-load-rebalancing 
strategy,
 such that multiple writers is prone to write into the same **small** buckets 
at the same time, which causes conflicts.
 Therefore, OCC does not work well for multiple streaming writers ingestion. In 
this RFC, we propose a non-blocking solution for streaming ingestion.
@@ -26,7 +32,7 @@ Therefore, OCC does not work well for multiple streaming 
writers ingestion. In t
 Streaming jobs are suitable for data ingestion since it does not need complex 
pipeline orchestration and has a smother write workload.
 Most of the raw data set we are handling today are generated constantly in 
streaming way.
 
-In multi-writer ingestion, several streaming events with the same schema sink 
into one Hudi table, such that the Hudi table becomes 
+In multi-writer streaming ingestion, several streaming events with the same 
schema sink into one Hudi table, such that the table becomes
 a UNION table view for all input data set. This is a common use case in 
reality since the data could come from various data sources.
 
 Another important use case we want to unlock is the real-time data set join. 
One of the serious pain points in streaming computation is the dataset join.

(hudi) branch master updated: [MINOR] Add a brief abstract to RFC-66 (#10386)

Reply via email to