[
https://issues.apache.org/jira/browse/HUDI-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457244#comment-17457244
]
Raymond Xu edited comment on HUDI-1258 at 12/10/21, 4:59 PM:
-------------------------------------------------------------
[~vinoth] After going through the code, had some ideas for the solution. Just
to confirm my understanding, we probably need to do these 2
* in org.apache.hudi.table.action.commit.UpsertPartitioner#assignInserts,
record a count of updates before inserts will be appended to the updating small
files; this can be saved in bucket info from the upsert partitioner
* in org.apache.hudi.io.HoodieMergeHandle#init, when initiate keyToNewRecords,
stop adding record from the iterator when reaching the count. Inserts should be
saved to a separate "ExternalSpillableList" instead and processed after
Update: as discussed, it makes sense to error out during upsert when
duplicate-key records are seen in merge handle
was (Author: xushiyan):
[~vinoth] After going through the code, had some ideas for the solution. Just
to confirm my understanding, we probably need to do these 2
* in org.apache.hudi.table.action.commit.UpsertPartitioner#assignInserts,
record a count of updates before inserts will be appended to the updating small
files; this can be saved in bucket info from the upsert partitioner
* in org.apache.hudi.io.HoodieMergeHandle#init, when initiate keyToNewRecords,
stop adding record from the iterator when reaching the count. Inserts should be
saved to a separate "ExternalSpillableList" instead and processed after
> Small file handling Merges can be handled without actual merging
> ----------------------------------------------------------------
>
> Key: HUDI-1258
> URL: https://issues.apache.org/jira/browse/HUDI-1258
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Writer Core
> Affects Versions: 0.9.0
> Reporter: Vinoth Chandar
> Assignee: Raymond Xu
> Priority: Blocker
> Fix For: 0.11.0
>
>
> If a file slice gets inserts into MergeHandle, for file sizing reasons, there
> is no reason to really build the hashmap and merge.
>
> This will also avoid the issue of insert with the same duplicate key
> overwriting the previous value
--
This message was sent by Atlassian Jira
(v8.20.1#820001)