yihua commented on a change in pull request #4107:
URL: https://github.com/apache/hudi/pull/4107#discussion_r757293491



##########
File path: website/docs/markers.md
##########
@@ -0,0 +1,90 @@
+---
+title: Write Markers
+toc: true
+---
+
+## Purpose of Markers
+A write operation can fail before it completes, leaving partial or corrupt 
data files on storage. Markers are used to track 
+and cleanup any partial or failed write operations. As a write operation 
begins, a marker is created indicating 
+that a file write is in progress. When the write commit succeeds, the marker 
is deleted. If a write operation fails part 
+way through, a marker is left behind which indicates that the file is 
incomplete. Two important operations that use markers include: 
+
+- **Removing duplicate/partial data files**: 
+  - in Spark, the Hudi write client delegates the data file writing to 
multiple executors. One executor can fail the task, 
+  leaving partial data files written, and Spark retries the task in this case 
until it succeeds. 
+  - When speculative execution is enabled, there can also be multiple 
successful attempts at writing out the same data 
+  into different files, only one of which is finally handed to the Spark 
driver process for committing. 
+  The markers help efficiently identify the partial data files written, which 
contain duplicate data compared to the data 
+  files written by the successful trial later, and these duplicate data files 
are cleaned up when the commit is finalized.
+- **Rolling back failed commits**: If a write operation fails, the next write 
client will roll back the failed commit before proceeding with the new write. 
The rollback is done with the help of markers to identify the data files 
written as part of the failed commit.
+
+If we did not have markers to track the per-commit data files, we would have 
to list all files in the file system, 
+correlate that with the files seen in timeline and then delete the ones that 
belong to partial write failures. 
+As you could imagine, this would be very costly in a very large installation 
of a datalake.
+
+## Marker structure
+Each marker entry is composed of three parts, the data file name,
+the marker extension (`.marker`), and the I/O operation created the file 
(`CREATE` - inserts, `MERGE` - updates/deletes, 
+or `APPEND` - either). For example, the marker 
`91245ce3-bb82-4f9f-969e-343364159174-0_140-579-0_20210820173605.parquet.marker.CREATE`
 indicates
+that the corresponding data file is 
`91245ce3-bb82-4f9f-969e-343364159174-0_140-579-0_20210820173605.parquet` and 
the I/O type is `CREATE`.
+
+## Marker Writing Options
+There are two ways to configure Marker write operations. 
+
+- Directly writing markers to storage, which is a legacy configuration.
+- Writing markers to the Timeline Server (Default), improves write performance 
of large files by batching marker requests.

Review comment:
       `Writing markers to the Timeline Server`
   -> This might be confusing since the markers are still written to the 
storage, through the timeline server as a proxy.  The executors delegate the 
marker creation to the timeline servers and the timeline server acts as a 
single vantage point to coordinate the actual writing.

##########
File path: website/docs/configurations.md
##########
@@ -1151,8 +1151,8 @@ Configurations that control write behavior on Hudi 
tables. These can be directly
 ---
 
 > #### hoodie.write.markers.type
-> Marker type to use.  Two modes are supported: - DIRECT: individual marker 
file corresponding to each data file is directly created by the writer. - 
TIMELINE_SERVER_BASED: marker operations are all handled at the timeline 
service which serves as a proxy.  New marker entries are batch processed and 
stored in a limited number of underlying files for efficiency.<br></br>
-> **Default Value**: DIRECT (Optional)<br></br>
+> Marker type to use.  Two modes are supported: - DIRECT: individual marker 
file corresponding to each data file is directly created by the writer. - 
TIMELINE_SERVER_BASED: marker operations are all handled at the timeline 
service which serves as a proxy.  New marker entries are batch processed and 
stored in a limited number of underlying files for efficiency. Note: timeline 
based markers are not yet supported for HDFS <br></br>
+> **Default Value**: TIMELINE_SERVER_BASED (Optional)<br></br>

Review comment:
       I think this is automatically populated from the code.  @nsivabalan 
@bhasudha do you know how this can be updated automatically after #4112 is 
landed?

##########
File path: website/docs/markers.md
##########
@@ -0,0 +1,90 @@
+---
+title: Write Markers
+toc: true
+---
+
+## Purpose of Markers
+A write operation can fail before it completes, leaving partial or corrupt 
data files on storage. Markers are used to track 
+and cleanup any partial or failed write operations. As a write operation 
begins, a marker is created indicating 
+that a file write is in progress. When the write commit succeeds, the marker 
is deleted. If a write operation fails part 
+way through, a marker is left behind which indicates that the file is 
incomplete. Two important operations that use markers include: 
+
+- **Removing duplicate/partial data files**: 
+  - in Spark, the Hudi write client delegates the data file writing to 
multiple executors. One executor can fail the task, 

Review comment:
       nit: capitalize `In`?

##########
File path: website/docs/markers.md
##########
@@ -0,0 +1,90 @@
+---
+title: Write Markers
+toc: true
+---
+
+## Purpose of Markers
+A write operation can fail before it completes, leaving partial or corrupt 
data files on storage. Markers are used to track 
+and cleanup any partial or failed write operations. As a write operation 
begins, a marker is created indicating 
+that a file write is in progress. When the write commit succeeds, the marker 
is deleted. If a write operation fails part 
+way through, a marker is left behind which indicates that the file is 
incomplete. Two important operations that use markers include: 
+
+- **Removing duplicate/partial data files**: 
+  - in Spark, the Hudi write client delegates the data file writing to 
multiple executors. One executor can fail the task, 
+  leaving partial data files written, and Spark retries the task in this case 
until it succeeds. 
+  - When speculative execution is enabled, there can also be multiple 
successful attempts at writing out the same data 
+  into different files, only one of which is finally handed to the Spark 
driver process for committing. 
+  The markers help efficiently identify the partial data files written, which 
contain duplicate data compared to the data 
+  files written by the successful trial later, and these duplicate data files 
are cleaned up when the commit is finalized.
+- **Rolling back failed commits**: If a write operation fails, the next write 
client will roll back the failed commit before proceeding with the new write. 
The rollback is done with the help of markers to identify the data files 
written as part of the failed commit.
+
+If we did not have markers to track the per-commit data files, we would have 
to list all files in the file system, 
+correlate that with the files seen in timeline and then delete the ones that 
belong to partial write failures. 
+As you could imagine, this would be very costly in a very large installation 
of a datalake.
+
+## Marker structure
+Each marker entry is composed of three parts, the data file name,
+the marker extension (`.marker`), and the I/O operation created the file 
(`CREATE` - inserts, `MERGE` - updates/deletes, 
+or `APPEND` - either). For example, the marker 
`91245ce3-bb82-4f9f-969e-343364159174-0_140-579-0_20210820173605.parquet.marker.CREATE`
 indicates
+that the corresponding data file is 
`91245ce3-bb82-4f9f-969e-343364159174-0_140-579-0_20210820173605.parquet` and 
the I/O type is `CREATE`.
+
+## Marker Writing Options
+There are two ways to configure Marker write operations. 

Review comment:
       This is more like different options to the same configuration of the 
marker type (`hoodie.write.markers.type`), either direct or 
timeline-server-based, instead of `two ways to configure`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to