umehrot2 commented on issue #1764: URL: https://github.com/apache/hudi/issues/1764#issuecomment-649914448
@vinothchandar @bvaradar looking at the logic we are forming the list of invalid data file paths to be deleted from the marker file paths. One possible reason that seems to me can be that marker file got created but corresponding data file was never written by spark because failure happened before the file was written. Now we are expecting that file to appear, but it was never created in the first place. Do you guys think its possible ? I will also dive more into the marker file code to understand. On a similar note regarding handling of marker files, I have narrowed down some performance issues with S3 in the marker files clean up code. https://issues.apache.org/jira/browse/HUDI-1054 @zuyanton might be of interest to you. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
