alexeykudinkin commented on a change in pull request #4724:
URL: https://github.com/apache/hudi/pull/4724#discussion_r815187048



##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java
##########
@@ -58,6 +58,31 @@ default T preCombine(T oldValue, Properties properties) {
     return preCombine(oldValue);
   }
 
+  /**
+   *When more than one HoodieRecord have the same HoodieKey in the incoming 
batch, this function combines them before attempting to insert/upsert by taking 
in a property map.
+   *
+   * @param oldValue instance of the old {@link HoodieRecordPayload} to be 
combined with.
+   * @param properties Payload related properties. For example pass the 
ordering field(s) name to extract from value in storage.
+   * @param schema Schema used for record
+   * @return the combined value
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
+  default T preCombine(T oldValue, Properties properties, Schema schema) {

Review comment:
       Let me try to clarify a few things: 
   
   `preCombine` has a _very specific_ semantic: it's de-duplicating by the way 
of picking "most recent" among records in the batch. Expectation always is that 
it being handed 2 records it will **have to** return either of them. It could 
not produce new record. If we want to revisit this semantic this is a far 
larger change that will surely require writing an RFC and broader discussion 
regarding the merits of such migration. Please also keep in mind that as of 
RFC-46 there's an effort underway to abstract whole "record 
combination/merging" semantic out of `RecordPayload` hierarchy into standalone 
Combination/Merge Engine API.
   
   > First, from the description of preCombine method, it used for combining 
multiple records with same HoodieKey before attempting to insert/upsert to 
disk. The "combine multiple records" might not mean only choosing one of them, 
we also can combine & merged them to a new one, just depends on how the 
sub-class implement the preCombine logic(Please correct me if my understanding 
is wrong :) ). Yeah, it might be a little bit confused that we need Schema if 
we are trying to merged them.
   
   Please see my comment regarding `preCombine` semantic above. I certainly 
agree with you that the name is confusing, but i've tried to clear that 
confusion. Let me know if you have more questions about it.
   
   > Second, I checked when will we call preCombine method is trying to 
duplicate records with same HoodieKey before insert/update to disk, especially 
in Flink write case, even through the duplicated logic is choose the latest 
record, but we need to ensure that one HoodieKey should only contains one 
record before comparing to existing record and write to disk, otherwise, some 
records will missed. For example, in HoodieMergeHandle.init(fieId, 
newRecordsIter), it will convert the record iterator to a map and treat the 
recordKey as key. So we might not stop de-duping logics and merge them against 
what is on disk unless we change the logic here. And also we implement another 
class/method to handle the merge logic, and switch the existing de-duping logic 
from calling preCombine to new class/method, we have to add an condition to 
control whether should we call preCombine or not, I think it might not a good 
way. Instead, we should handle it in preCombine method by different implemented 
payl
 oad.
   
   You're bringing up a good points, let's dive into them one by one: so 
currently we have 2 mechanisms
   
   1. `preCombine` that allows to select "most recent" record among those 
having the same key w/in the batch
   2. `combineAndGetUpdateValue` that allows to combine previous or 
"historical" record (on Disk) with the new incoming one (all partial merging 
semantic is currently implemented in this method)
   
   You rightfully mention some of the invariants are currently that the batch 
would be de-duped at certain level (b/c we have to maintain PK uniqueness on 
disk), and so we might need to shift that to accommodate for case that you 
have. And that's exactly what my question was: if you can elaborate on use-case 
that you have at hand that you're trying to solve w/ this PR, i would be able 
to better understand where you're coming from and what's the best path forward 
for us here.
   
   Questions i'm looking an answers for are basically following:
   
   1. What's nature of your use-case? (domain, record types, frequency, size, 
etc)
   2. Where requirements for partial updates are coming from?
   
   and etc.  I'm happy to set some 30min to talk in person regarding this or 
connect on Slack and discuss it there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to