Udit Mehrotra created HUDI-1054:
-----------------------------------
Summary: Address performance issues with finalizing writes on S3
Key: HUDI-1054
URL: https://issues.apache.org/jira/browse/HUDI-1054
Project: Apache Hudi
Issue Type: Sub-task
Components: bootstrap, Common Core, Performance
Reporter: Udit Mehrotra
Assignee: Udit Mehrotra
I have identified 3 performance bottleneck in the
[finalizeWrite|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L378]
function, that are manifesting and becoming more prominent with the new
bootstrap mechanism on S3:
*
[https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425]
is a serial operation performed at the driver and it can take a long time
when you have several partitions and large number of files.
* The invalid data paths are being stored in a List instead of Set and as a
result the following operation becomes N^2 taking significant time to compute
at the driver:
[https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429]
*
[https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473]
does a recursive delete of the marker directory at the driver. This is again
extremely expensive when you have large number of partitions and files.
Upon testing with a 1 TB data set, having 8000 partitions and approximately
190000 files this whole process consumes *35 minutes*. There is scope to
address these performance issues with spark parallelization and using
appropriate data structures.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)