[GitHub] [hudi] ksrihari93 opened a new issue, #5822: Hudi Clustering not working

GitBox Thu, 09 Jun 2022 10:12:01 -0700


ksrihari93 opened a new issue, #5822:
URL: https://github.com/apache/hudi/issues/5822


   
   **Describe the problem you faced**
   
   Hudi Clustering not working.
   
   A clear and concise description of the problem.
   
   I'm using Hudi Delta streamer in continuous mode with Kafka source.
   
   we have 120 partitions in the Kafka topic and the ingestion rate is (200k) 
RPM
   
   we are using the BULK INSERT mode to ingest data into target location .
   
   But we could see that lot of small files were being generated. In Order to 
overcome this small file problem we are using the Hudi Clustering ,still we 
could see files were not being merged. 
   
   Configuration for the JOB is 
   
   #base properties
   
   hoodie.insert.shuffle.parallelism=50
   hoodie.bulkinsert.shuffle.parallelism=200
   hoodie.embed.timeline.server=true
   hoodie.filesystem.view.type=EMBEDDED_KV_STORE
   hoodie.compact.inline=false
   hoodie.bulkinsert.sort.mode=none
   
   
   #cleaner properties
   hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
   hoodie.cleaner.fileversions.retained=60
   hoodie.clean.async=true
   
   #archival
   hoodie.keep.min.commits=12
   hoodie.keep.max.commits=15
   
   #datasource properties
   hoodie.deltastreamer.schemaprovider.registry.url=
   hoodie.datasource.write.recordkey.field=
   hoodie.deltastreamer.source.kafka.topic=
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
   hoodie.datasource.write.partitionpath.field=timestamp:TIMESTAMP
   hoodie.deltastreamer.kafka.source.maxEvents=600000000
   hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
   hoodie.deltastreamer.keygen.timebased.input.timezone=UTC
   hoodie.deltastreamer.keygen.timebased.output.timezone=UTC
   hoodie.deltastreamer.keygen.timebased.output.dateformat='dt='yyyy-MM-dd
   hoodie.clustering.async.enabled=true
   hoodie.clustering.plan.strategy.target.file.max.bytes=3000000000
   hoodie.clustering.plan.strategy.small.file.limit=200000001
   hoodie.clustering.async.max.commits=1
   hoodie.clustering.plan.strategy.max.num.groups=10
   
   
   #kafka props
   bootstrap.servers=
   schema.registry.url=
   
   Deltastreamer Class Arguments:
   
       - "--table-type"
       - "COPY_ON_WRITE"
       - "--props"
       - "/opt/spark/hudi/config/source.properties"
       - "--schemaprovider-class"
       - "org.apache.hudi.utilities.schema.SchemaRegistryProvider"
       - "--source-class"
       - "org.apache.hudi.utilities.sources.JsonKafkaSource"
       - "--target-base-path"
       - ""
       - "--target-table"
       - ""
       - "--op"
       - "BULK_INSERT"
       - "--source-ordering-field"
       - "timestamp"
       - "--continuous"
       - "--min-sync-interval-seconds"
       - "60"
   
   
   * Hudi version :0.9
   
   * Spark version :2.4.4
   
   
   * Storage (HDFS/S3/GCS..) :BLOB
   
   * Running on Docker? (yes/no) :Kubernetes 
   
   
   
   **Stacktrace**
   
   ```22/06/09 22:01:36 INFO ClusteringUtils: Found 0 files in pending 
clustering operations
   22/06/09 22:11:07 INFO ClusteringUtils: Found 0 files in pending clustering 
operations
   22/06/09 22:11:07 INFO RocksDbBasedFileSystemView: Resetting file groups in 
pending clustering to ROCKSDB based file-system view at 
/tmp/hoodie_timeline_rocksdb, Total file-groups=0```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] ksrihari93 opened a new issue, #5822: Hudi Clustering not working

Reply via email to