[GitHub] [hudi] CrazyBeeline opened a new issue #5105: cluster with incorrect partition

GitBox Wed, 23 Mar 2022 01:09:20 -0700


CrazyBeeline opened a new issue #5105:
URL: https://github.com/apache/hudi/issues/5105



   Steps to reproduce the behavior:
   
   1. extract source from kafka with HoodieDeltaStreamer
   
![image](https://user-images.githubusercontent.com/25030234/159650915-aa4b4676-7fdc-47c5-b1ce-e5a3a511b2cc.png)
   insert_cluster.properties main configure
   
   hoodie.upsert.shuffle.parallelism=100
   hoodie.insert.shuffle.parallelism=100
   hoodie.bulkinsert.shuffle.parallelism=100
   hoodie.delete.shuffle.parallelism=100
   hoodie.rollback.parallelism=100
   hoodie.cleaner.parallelism=100
   
   hoodie.datasource.write.recordkey.field=insert_time,id
   hoodie.datasource.write.partitionpath.field=create_time:TIMESTAMP
   hoodie.datasource.write.precombine.field=insert_time
   
   hoodie.table.base.file.format=PARQUET
   
   hoodie.datasource.write.hive_style_partitioning=true
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
   
   hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING
   hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy/MM/dd
   hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy-MM-dd
   
   
   ##### memory
   #  on heap
   hoodie.memory.merge.fraction=0.6
   hoodie.memory.merge.max.size=1073741824
   
   hoodie.memory.compaction.fraction=0.6
   #hoodie.memory.compaction.max.size=
   
   ###### storage ######
   hoodie.logfile.data.block.max.size=268435456
   hoodie.logfile.max.size=1073741824
   hoodie.parquet.max.file.size=125829120
   hoodie.parquet.small.file.limit=104857600
   #for mor
   hoodie.logfile.to.parquet.compression.ratio=0.35
   
   
   ## kafka source
   hoodie.deltastreamer.source.kafka.topic=hive-kafka-hudi
   bootstrap.servers=hadoop02:9092,hadoop01:9092,hadoop03:9092
   auto.offset.reset=earliest
   
   ##### hudi table ####
   hoodie.database.name=default
   ##### hive sink ########
   hoodie.datasource.hive_sync.database=default
   hoodie.datasource.hive_sync.table=hudi_person_insert_cluster
   hoodie.datasource.hive_sync.username=root
   hoodie.datasource.hive_sync.password=
   hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hadoop03:10000
   hoodie.datasource.hive_sync.partition_fields=create_time
   hoodie.datasource.hive_sync.use_jdbc=false
   hoodie.datasource.hive_sync.support_timestamp=false
   hoodie.datasource.hive_sync.create_managed_table=false
   hoodie.datasource.hive_sync.sync_as_datasource=true
   hoodie.datasource.hive_sync.batch_num=10000
   hoodie.datasource.hive_sync.assume_date_partitioning=false
   hoodie.datasource.hive_sync.bucket_sync=false
   hoodie.datasource.hive_sync.auto_create_database=true
   hoodie.datasource.hive_sync.enable=true
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
   #hoodie.datasource.hive_sync.skip_ro_suffix=true
   #hoodie.datasource.hive_sync.create_managed_table=false
   hoodie.embed.timeline.server=true
   
hoodie.deltastreamer.schemaprovider.source.schema.file=file:///opt/software/hudi/schame
   ######### compaction #######
   hoodie.compact.inline=false
   hoodie.compact.inline.max.delta.commits=10
   
   
   ###### clean #####
   hoodie.clean.automatic=true
   ## KEEP_LATEST_FILE_VERSIONS or KEEP_LATEST_COMMITS
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS
   hoodie.cleaner.delete.bootstrap.base.file=true
   hoodie.cleaner.commits.retained=3
   # lazily  for multi-writers
   hoodie.cleaner.policy.failed.writes=EAGER
   
   # for KEEP_LATEST_FILE_VERSIONS
   hoodie.cleaner.fileversions.retained=3
   
   
   ##### clustering #######
   hoodie.clustering.inline=true
   hoodie.clustering.inline.max.commits=10
   
   
   hoodie.clustering.async.enabled=false
   hoodie.clustering.async.max.commits=3
   
   hoodie.clustering.preserve.commit.metadata=true
   
   
   hoodie.clustering.plan.strategy.target.file.max.bytes=133169152
   hoodie.clustering.plan.strategy.small.file.limit=52428800
   hoodie.clustering.plan.strategy.sort.columns=insert_time
   
   
hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
   hoodie.clustering.plan.strategy.daybased.lookback.partitions=1
   hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions=2
   
   
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   
   
hoodie.clustering.updates.strategy=org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy
   
   #### Multi Writer ######
   #single_writer
   hoodie.write.concurrency.mode=optimistic_concurrency_control
   #EAGER
   hoodie.cleaner.policy.failed.writes=LAZY
   
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
   hoodie.write.lock.zookeeper.url=hadoop01,hadoop02,hadoop03
   hoodie.write.lock.zookeeper.port=2181
   hoodie.write.lock.zookeeper.lock_key=hive_kafka_hudi
   hoodie.write.lock.zookeeper.base_path=/hudi_lock
   hoodie.write.lock.zookeeper.connection_timeout_ms=15000
   hoodie.write.lock.zookeeper.session_timeout_ms=60000
   #### commit callback####
   hoodie.write.commit.callback.on=true
   
hoodie.write.commit.callback.class=org.apache.hudi.utilities.callback.kafka.HoodieWriteCommitKafkaCallback
   
hoodie.write.commit.callback.kafka.bootstrap.servers=hadoop02:9092,hadoop01:9092,hadoop03:9092
   hoodie.write.commit.callback.kafka.topic=hudi_commit_callback
   #
   #hoodie.write.commit.callback.kafka.partition=
   hoodie.write.commit.callback.kafka.acks=all
   hoodie.write.commit.callback.kafka.retries=3
   
   #### metadata ####
   hoodie.metadata.clean.async=false
   hoodie.metadata.cleaner.commits.retained=3
   hoodie.metadata.compact.max.delta.commits=10
   hoodie.metadata.keep.max.commits=30
   hoodie.metadata.keep.min.commits=20
   hoodie.commits.archival.batch=10
   ##### archive
   hoodie.archive.automatic=true
   hoodie.archivelog.folder=archived
   hoodie.archive.delete.parallelism=10
   
   2.I HAVE three partitions 
   
![image](https://user-images.githubusercontent.com/25030234/159651430-931ec360-cf14-4f9c-a98a-2f46ae22d401.png)
   3.
   Based on the above configuration, I only know that the partition of cluster 
is create_time=2021-03-08
   But actually all partitions will cluster
   4. I did a test
   hoodie.clustering.inline=false
   partition create_time=2021-03-08 Other partitions are similar
   
![image](https://user-images.githubusercontent.com/25030234/159652181-162b3bed-d11b-40bd-aab5-7c809c89714d.png)
   hoodie.clustering.inline=true
   partition create_time=2021-03-08  have a cluster ops
   
![image](https://user-images.githubusercontent.com/25030234/159652259-4228998f-89cf-4303-8c62-e46deeb426a9.png)
   partition create_time=2021-03-09 also have a cluster ops
   
![image](https://user-images.githubusercontent.com/25030234/159652352-bbb84ded-5865-43e0-9164-ace9af058c49.png)
   partition create_time=2021-03-10 also have a cluster ops
   
![image](https://user-images.githubusercontent.com/25030234/159652431-ce9410fb-df18-4476-bc61-aa8727dbcbaf.png)
   
   
   
   **Environment Description**
   
   * Hudi version :0.10.1
   
   * Spark version :3.1.3
   
   * Hive version :3.1.2
   
   * Hadoop version :3.2.2
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CrazyBeeline opened a new issue #5105: cluster with incorrect partition

Reply via email to