[GitHub] [hudi] Gatsby-Lee opened a new issue #4896: [SUPPORT] Metadata Table causes missing data.

GitBox Wed, 23 Feb 2022 21:56:21 -0800


Gatsby-Lee opened a new issue #4896:
URL: https://github.com/apache/hudi/issues/4896



   **Describe the problem you faced**
   
   Regardless the table type ( CoW, MoR ), I notice missing data when Metadata 
Table is enabled.
   
   For example, If I ingest 100,000 records ( no dups ) with the batch size 
10,000, the ingested records in Hudi are not 100,000.
   
   I checked the number or records through Amazon Athena and also 
double-checked the count by running Spark Job as well.
   
   **Full Configuration**
   
   ```
   {
        'className': 'org.apache.hudi'
        'hoodie.datasource.hive_sync.database': 'hudi_exp'
        'hoodie.datasource.hive_sync.enable': 'true'
        'hoodie.datasource.hive_sync.support_timestamp': 'true'
        'hoodie.datasource.hive_sync.table': 'hudi_etl_exp'
        'hoodie.datasource.hive_sync.use_jdbc': 'false'
        'hoodie.datasource.write.hive_style_partitioning': 'true'
        'hoodie.datasource.write.partitionpath.field': 'org_id'
        'hoodie.datasource.write.recordkey.field': 'obj_id'
        'hoodie.table.name': 'hudi_etl_exp'
        'hoodie.bulkinsert.shuffle.parallelism': '24'
        'hoodie.delete.shuffle.parallelism': '24'
        'hoodie.insert.shuffle.parallelism': '24'
        'hoodie.upsert.shuffle.parallelism': '24'
        'hoodie.index.type': 'BLOOM'
        'hoodie.bloom.index.prune.by.ranges': 'true'
        'hoodie.datasource.clustering.async.enable': 'false'
        'hoodie.datasource.clustering.inline.enable': 'false'
        'hoodie.datasource.compaction.async.enable': 'false'
        'hoodie.clean.automatic': 'true'
        'hoodie.clean.async': 'true'
        'hoodie.keep.max.commits': 40
        'hoodie.keep.min.commits': 30
        'hoodie.cleaner.commits.retained': 20
        'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS'
        'hoodie.compact.inline': 'false'
        'hoodie.clustering.async.enabled': 'false'
        'hoodie.clustering.async.max.commits': 4
        'hoodie.clustering.inline': 'false'
        'hoodie.metadata.clean.async': 'true'
        'hoodie.cleaner.policy.failed.writes': 'LAZY'
        'hoodie.write.concurrency.mode': 'OPTIMISTIC_CONCURRENCY_CONTROL'
        'hoodie.write.lock.provider': 
'org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider'
        'hoodie.write.lock.zookeeper.port': '2181'
        'hoodie.write.lock.zookeeper.url': 'zookeeper_url'
        'hoodie.write.lock.zookeeper.base_path': 'zookeeper_base_path'
        'hoodie.write.lock.zookeeper.lock_key': 'hudi_etl_exp'
        'path': 's3://hello-hudi/hudi_exp/hudi_etl_exp'
        'hoodie.datasource.write.precombine.field': '_etl_cluster_ts'
        'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor'
        'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.SimpleKeyGenerator'
        'hoodie.datasource.hive_sync.partition_fields': 'org_id'
        'hoodie.combine.before.upsert': 'true'
        'hoodie.datasource.write.operation': 'upsert'
        'hoodie.datasource.write.table.type': 'COPY_ON_WRITE'
        'hoodie.table.type': 'COPY_ON_WRITE'
        'hoodie.metadata.enable': 'true'
   }
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. generates random 100 records
   2. ingest 10 records per batch
   3. count number of ingested records ( 10, 20, 30 )
   
   
   **Expected behavior**
   
   The all 100 records have to be on Hudi Tables
   
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * Spark version : 3.1.1-amzn-0
   
   * Hive version : 2.3.7-amzn-4
   
   * Hadoop version : 3.2.1-amzn-3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] Gatsby-Lee opened a new issue #4896: [SUPPORT] Metadata Table causes missing data.

Reply via email to