[ https://issues.apache.org/jira/browse/HUDI-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457244#comment-17457244 ]
Raymond Xu commented on HUDI-1258: ---------------------------------- [~vinoth] After going through the code, had some ideas for the solution. Just to confirm my understanding, we probably need to do these 2 * in org.apache.hudi.table.action.commit.UpsertPartitioner#assignInserts, record a count of updates before inserts will be appended to the updating small files; this can be saved in bucket info from the upsert partitioner * in org.apache.hudi.io.HoodieMergeHandle#init, when initiate keyToNewRecords, stop adding record from the iterator when reaching the count. Inserts should be saved to a separate "ExternalSpillableList" instead and processed after > Small file handling Merges can be handled without actual merging > ---------------------------------------------------------------- > > Key: HUDI-1258 > URL: https://issues.apache.org/jira/browse/HUDI-1258 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core > Affects Versions: 0.9.0 > Reporter: Vinoth Chandar > Assignee: Raymond Xu > Priority: Blocker > Fix For: 0.11.0 > > > If a file slice gets inserts into MergeHandle, for file sizing reasons, there > is no reason to really build the hashmap and merge. > > This will also avoid the issue of insert with the same duplicate key > overwriting the previous value -- This message was sent by Atlassian Jira (v8.20.1#820001)