Rajesh Mahindra created HUDI-9649:
-------------------------------------

             Summary: Avoid DAG retriggers in StreamSync (DeltaStreamer)
                 Key: HUDI-9649
                 URL: https://issues.apache.org/jira/browse/HUDI-9649
             Project: Apache Hudi
          Issue Type: Improvement
          Components: deltastreamer
            Reporter: Rajesh Mahindra


In StreamSync, before calling commit, DAG gets triggered at 2 places:


long totalErrorRecords = 
writeStatusRDD.mapToDouble(WriteStatus::getTotalErrorRecords).sum().longValue();
long totalRecords = 
writeStatusRDD.mapToDouble(WriteStatus::getTotalRecords).sum().longValue();



Although we persist the write status, for large scale ingestion batches (that 
touch in the order of 10Ks of files), it can add 10-20mins to the commit time 
since the RDD of write status is large.

Implement callbacks or alike in the write client after it collects the write 
status (before commiting). Such callbacks can be used by consumers such as 
StreamSync to apply logic based on number of error records etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to