kasured commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1099994068

   Upon further investigation and after enabling additional logs on EMR, the 
deletion of the file during compaction is happening in the class 
org.apache.hudi.table.HoodieTable#reconcileAgainstMarkers
   
   ```
   if (!invalidDataPaths.isEmpty()) {
           LOG.info("Removing duplicate data files created due to spark retries 
before committing. Paths=" + invalidDataPaths);`
   ```
   
   However, later in the logs this file is written and commited in the instant 
   ```
   INFO SparkRDDWriteClient: Committing Compaction 20220414232316. Finished 
with result 
HoodieCommitMetadata{partitionToWriteStats={cluster=96/shard=14377=[HoodieWriteStat{fileId='9d9f72e9-9381-40d0-af0c-cb48c25bd78d-0',
 
path='cluster=96/shard=14377/9d9f72e9-9381-40d0-af0c-cb48c25bd78d-0_0-617-7132_20220414232316.parquet',
 prevCommit='20220414225217', numWrites=122886, numDeletes=0, 
numUpdateWrites=121939, totalWriteBytes=23331178, totalWriteErrors=0, 
tempPath='null', partitionPath='cluster=96/shard=14377', 
totalLogRecords=341027, totalLogFilesCompacted=3, 
totalLogSizeCompacted=285373803, totalUpdatedRecordsCompacted=121939, 
totalLogBlocks=9, totalCorruptLogBlock=0, totalRollbackBlocks=0}]}, 
compacted=true,
   ```
   So it leaves the system in an inconsistent state. It looks like some 
concurrency issues to me
   
   I will try to submit multiple StreamingQuery in different threads by 
leveraging spark scheduling pool. Will update about the status


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to