josuexcxc opened a new issue #4754:
URL: https://github.com/apache/hudi/issues/4754


   my initial hudi table contains duplicates for several record keys, when 
writing updates to these duplicate records, hudi keeps me a single record and I 
need it to keep the same number of duplicate records
    
   initial table
   `+---------+-----------+------------+------+------------+
   |AccountID|CreatedDate|ModifiedDate|Amount|CurrencyCode|
   +---------+-----------+------------+------+------------+
   |      500|   22/10/21|    22/10/21|   502|         MXN|
   |      500|   22/10/21|    22/10/21|   502|         MXN|
   |      501|   22/10/21|    22/10/21|  1969|         MXN|
   |      502|   22/10/21|    22/10/21|  1612|         MXN|
   |      503|   22/10/21|    22/10/21|  1559|         MXN|
   |      504|   22/10/21|    22/10/21|  1494|         MXN|
   |      505|   22/10/21|    22/10/21|  1448|         MXN|
   |      506|   22/10/21|    22/10/21|  1059|         USD|
   |      507|   22/10/21|    22/10/21|   795|         USD|
   |      508|   22/10/21|    22/10/21|   822|         USD|
   |      509|   22/10/21|    22/10/21|  1612|         MXN|
   |      510|   22/10/21|    22/10/21|  1578|         MXN|
   |      510|   22/10/21|    22/10/21|  1578|         MXN|
   |      511|   22/10/21|    22/10/21|   709|         USD|
   +---------+-----------+------------+------+------------+`
   upsertDF
   `+---------+-----------+------------+------+------------+
   |AccountID|CreatedDate|ModifiedDate|Amount|CurrencyCode|
   +---------+-----------+------------+------+------------+
   |520      |22/10/21   |22/10/21    |713   |USD         |
   |520      |22/10/21   |22/10/21    |713   |USD         |
   |510      |22/10/21   |22/10/21    |1578  |MXN         |
   |510      |22/10/21   |22/10/21    |1578  |MXN         |
   |500      |22/10/21   |22/10/21    |502   |MXN         |
   |500      |22/10/21   |22/10/21    |502   |MXN         |
   |515      |22/10/21   |22/10/21    |1803  |MXN         |
   +---------+-----------+------------+------+------------+`
   
   hudi table with applied upsert operation
   
   `+---------+-----------+------------+------+------------+
   |AccountID|CreatedDate|ModifiedDate|Amount|CurrencyCode|
   +---------+-----------+------------+------+------------+
   |501      |22/10/21   |22/10/21    |1969  |MXN         |
   |502      |22/10/21   |22/10/21    |1612  |MXN         |
   |503      |22/10/21   |22/10/21    |1559  |MXN         |
   |504      |22/10/21   |22/10/21    |1494  |MXN         |
   |505      |22/10/21   |22/10/21    |1448  |MXN         |
   |506      |22/10/21   |22/10/21    |1059  |USD         |
   |507      |22/10/21   |22/10/21    |795   |USD         |
   |508      |22/10/21   |22/10/21    |822   |USD         |
   |509      |22/10/21   |22/10/21    |1612  |MXN         |
   |510      |22/10/21   |23/10/21    |1600  |MXN         |
   |511      |22/10/21   |22/10/21    |709   |USD         |`
   
   as you can see in the hudi table I only have one record with AccountID 510, 
when there should be 2
   
   **Hudi Configuration**
   table_name = "foto"
   localPath = 
f"s3://dev-vol-model-zone/mdl_com_ancillaries/public.bt_ancillaries_final/hudi_tables/{table_name}/"
   key_fields = "AccountID,Amount,ModifiedDate"
   precombine_fields = "ModifiedDate"
   partition_fields = ""
   
   hudiOptions = {
       'hoodie.table.name': table_name,
       'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
       'hoodie.datasource.write.recordkey.field': key_fields,
       'hoodie.datasource.write.partitionpath.field': partition_fields,
       'hoodie.datasource.write.precombine.field': precombine_fields,
       "hoodie.datasource.write.table.type": "MERGE_ON_READ",
       }
   
   foto1.write.format('org.apache.hudi') \
       .option('hoodie.datasource.write.operation', 'insert') \
       .options(**hudiOptions) \
       .mode('overwrite') \
       .save(localPath)
   
   
upserts.write.format("org.apache.hudi").options(**hudiOptions).mode("append").save(localPath)
   
   Spark: 2.4.7
   EMR: 5.33.0
   HUDI: 0.7.0-amzn-1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to