bhasudha commented on code in PR #9372: URL: https://github.com/apache/hudi/pull/9372#discussion_r1306240349
########## website/docs/concurrency_control.md: ########## @@ -2,105 +2,126 @@ title: "Concurrency Control" summary: In this page, we will discuss how to perform concurrent writes to Hudi Tables. toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 last_modified_at: 2021-03-19T15:59:57-04:00 --- +Concurrency control defines how different writers/readers coordinate access to the table. Hudi ensures atomic writes, by way of publishing commits atomically to the timeline, stamped with an instant time that denotes the time at which the action is deemed to have occurred. Unlike general purpose file version control, Hudi draws clear distinction between writer processes (that issue user’s upserts/deletes), table services (that write data/metadata to optimize/perform bookkeeping) and readers (that execute queries and read data). Hudi provides snapshot isolation between all three types of processes, meaning they all operate on a consistent snapshot of the table. Hudi provides optimistic concurrency control (OCC) between writers, while providing lock-free, non-blocking MVCC based concurrency control between writers and table-services and between different table services. -In this section, we will cover Hudi's concurrency model and describe ways to ingest data into a Hudi Table from multiple writers; using the [Hudi Streamer](#hudi-streamer) tool as well as -using the [Hudi datasource](#datasource-writer). +In this section, we will discuss the different concurrency controls supported by Hudi and how they are leveraged to provide flexible deployment models; we will cover multi-writing, a popular deployment model; finally, we’ll describe ways to ingest data into a Hudi Table from multiple writers using different writers, like DeltaStreamer, Hudi datasource, Spark Structured Streaming and Spark SQL. -## Supported Concurrency Controls -- **MVCC** : Hudi table services such as compaction, cleaning, clustering leverage Multi Version Concurrency Control to provide snapshot isolation -between multiple table service writers and readers. Additionally, using MVCC, Hudi provides snapshot isolation between an ingestion writer and multiple concurrent readers. - With this model, Hudi supports running any number of table service jobs concurrently, without any concurrency conflict. - This is made possible by ensuring that scheduling plans of such table services always happens in a single writer mode to ensure no conflict and avoids race conditions. +## Deployment models with supported concurrency controls -- **[NEW] OPTIMISTIC CONCURRENCY** : Write operations such as the ones described above (UPSERT, INSERT) etc, leverage optimistic concurrency control to enable multiple ingestion writers to -the same Hudi Table. Hudi supports `file level OCC`, i.e., for any 2 commits (or writers) happening to the same table, if they do not have writes to overlapping files being changed, both writers are allowed to succeed. - This feature is currently *experimental* and requires either Zookeeper or HiveMetastore to acquire locks. +### Model A: Single writer with inline table services Review Comment: Yeah, these are addressed in https://github.com/apache/hudi/pull/9372/files#diff-9b6f64b1c2c1b2e6b9165165ee949233d9df1f6bd3ce0771b6523f76c16d988cR20 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
