[ 
https://issues.apache.org/jira/browse/HUDI-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457244#comment-17457244
 ] 

Raymond Xu commented on HUDI-1258:
----------------------------------

[~vinoth] After going through the code, had some ideas for the solution. Just 
to confirm my understanding, we probably need to do these 2
 * in org.apache.hudi.table.action.commit.UpsertPartitioner#assignInserts, 
record a count of updates before inserts will be appended to the updating small 
files; this can be saved in bucket info from the upsert partitioner
 * in org.apache.hudi.io.HoodieMergeHandle#init, when initiate keyToNewRecords, 
stop adding record from the iterator when reaching the count. Inserts should be 
saved to a separate "ExternalSpillableList" instead and processed after 

> Small file handling Merges can be handled without actual merging
> ----------------------------------------------------------------
>
>                 Key: HUDI-1258
>                 URL: https://issues.apache.org/jira/browse/HUDI-1258
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Writer Core
>    Affects Versions: 0.9.0
>            Reporter: Vinoth Chandar
>            Assignee: Raymond Xu
>            Priority: Blocker
>             Fix For: 0.11.0
>
>
> If a file slice gets inserts into MergeHandle, for file sizing reasons, there 
> is no reason to really build the hashmap and merge. 
>  
> This will also avoid the issue of insert with the same duplicate key 
> overwriting the previous value 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to