shangxinli commented on code in PR #18405:
URL: https://github.com/apache/hudi/pull/18405#discussion_r3035735908


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##########
@@ -872,6 +873,10 @@ private Pair<Option<String>, JavaRDD<WriteStatus>> 
writeToSinkAndDoMetaSync(Hood
           totalSuccessfulRecords);
       String commitActionType = CommitUtils.getCommitActionType(cfg.operation, 
HoodieTableType.valueOf(cfg.tableType));
 
+      // Run pre-commit streaming offset validators (if configured) before 
commit
+      SparkStreamerValidatorUtils.runValidators(props, instantTime, 
writeStatusRDD,
+          checkpointCommitMetadata, metaClient);

Review Comment:
   Good point. Addressed with the "log the error count" approach:
   
     1. Added getTotalWriteErrors() as a default method on ValidationContext — 
sums HoodieWriteStat.totalWriteErrors across all write stats.
     2. Updated StreamingOffsetValidator.validateOffsetConsistency() to include 
write error count in the message, with different language for the two cases:
        - writeErrors > 0 → "Non-zero write errors suggest records failed to 
write rather than silent data loss."
        - writeErrors == 0 → "This may indicate data loss or filtering."
     This way users can immediately distinguish write-failure deviation (case 
b, which commitOnErrors was designed for) from silent data loss (case a) 
without needing to dig into logs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to