Rajesh Mahindra created HUDI-9649:
-------------------------------------
Summary: Avoid DAG retriggers in StreamSync (DeltaStreamer)
Key: HUDI-9649
URL: https://issues.apache.org/jira/browse/HUDI-9649
Project: Apache Hudi
Issue Type: Improvement
Components: deltastreamer
Reporter: Rajesh Mahindra
In StreamSync, before calling commit, DAG gets triggered at 2 places:
long totalErrorRecords =
writeStatusRDD.mapToDouble(WriteStatus::getTotalErrorRecords).sum().longValue();
long totalRecords =
writeStatusRDD.mapToDouble(WriteStatus::getTotalRecords).sum().longValue();
Although we persist the write status, for large scale ingestion batches (that
touch in the order of 10Ks of files), it can add 10-20mins to the commit time
since the RDD of write status is large.
Implement callbacks or alike in the write client after it collects the write
status (before commiting). Such callbacks can be used by consumers such as
StreamSync to apply logic based on number of error records etc.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)