haripriyarhp opened a new issue, #6166:
URL: https://github.com/apache/hudi/issues/6166

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I am using the Kafka Hudi sink to write to S3. I am having mismatch in the 
number messages present in a topic and the number of records showing up in 
Athena for both MoR and CoW. For MoR, even after running the compaction there 
are some missing records. 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Initially, I sent 100 messages to a topic. It refelected in Athena after 
compaction. 
   2. Later sent 100 more new messages + some updates + some duplicates of 
previous 100. Record count was not correct. 
   3. And later sent like 1000 messages and still record count was not correct 
after compaction.
   4. The config file properties are 
   {
       "name": "hudi-sink",
       "config": {
                "bootstrap.servers": "localhost:9092",
                "connector.class": 
"org.apache.hudi.connect.HoodieSinkConnector",
                "tasks.max": "4",
                "control.topic.name": "hudi-control-topic-mor",
                "topics": "sensor",
                "hoodie.table.name": "sensor-mor",
                "hoodie.table.type": "MERGE_ON_READ",
                "key.converter": 
"org.apache.kafka.connect.storage.StringConverter",
                "value.converter": 
"org.apache.kafka.connect.storage.StringConverter",
                "hoodie.base.path": "s3a://path/sensor_mor",
                "hoodie.datasource.write.recordkey.field":"oid,styp,sname,ts",
                
"hoodie.datasource.write.partitionpath.field":"gid,datatype,origin,oid",
                "hoodie.datasource.write.keygenerator.type":"COMPLEX",
                "hoodie.datasource.write.hive_style_partitioning": "true",
                "hoodie.compact.inline.max.delta.commits":2,
                "fs.s3a.fast.upload": "true",
                "fs.s3a.access.key": "myaccesskey",
                "fs.s3a.secret.key": "secretkey",
                "hoodie.schemaprovider.class": 
"org.apache.hudi.schema.SchemaRegistryProvider",
                "hoodie.deltastreamer.schemaprovider.registry.url": 
"http://localhost:8081/subjects/sensor/versions/latest";,
                "hoodie.kafka.commit.interval.secs": 60
         }
   }
   
   **Expected behavior**
   
   Irrespective of the messages sent to topic (could be new messages or 
duplicates or updates), as described, the connector should append them to 
tables. 
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.1.3
   
   * Hive version :
   
   * Hadoop version : 3.2
     
   * Storage (HDFS/S3/GCS..) : S3
     
   * Running on Docker? (yes/no) : No
    
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to