[ https://issues.apache.org/jira/browse/HUDI-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HUDI-2943: --------------------------------- Labels: core-flow-ds pull-request-available sev:high (was: core-flow-ds sev:high) > Deltastreamer fails to continue with pending clustering after restart in > 0.10.0 and inline clustering > ----------------------------------------------------------------------------------------------------- > > Key: HUDI-2943 > URL: https://issues.apache.org/jira/browse/HUDI-2943 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer > Reporter: Harsha Teja Kanna > Assignee: sivabalan narayanan > Priority: Major > Labels: core-flow-ds, pull-request-available, sev:high > Attachments: image-2021-12-08-15-10-02-420.png > > > Deltastreamer fails to restart when there is a pending clustering commit from > a previous run with Upsert failed exception when inline clustering is on. > {*}Note{*}: workaround of running Clustering job with > --retry-last-failed-clustering-job works > Hudi version : 0.10.0 > Spark version : 3.1.2 > EMR : 6.4.0 > diagnostics: User class threw exception: > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit > time 20211206081248919 > at > org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62) > at > org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103) > at > org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:159) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:501) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:306) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735) > Caused by: org.apache.hudi.exception.HoodieClusteringUpdateException: Not > allowed to update the clustering file group > HoodieFileGroupId\{partitionPath='', > fileId='39ca735d-1fc4-40f9-a314-93744642b38c-0'}. For pending clustering > operations, we are not going to support update for now. > at > org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy.lambda$handleUpdate$0(SparkRejectUpdateStrategy.java:65) > Config: > hoodie.index.type=GLOBAL_SIMPLE > hoodie.datasource.write.partitionpath.field= > hoodie.datasource.write.precombine.field=updatedate > hoodie.datasource.hive_sync.database=datalake > hoodie.datasource.write.operation=upsert > hoodie.datasource.hive_sync.table=hudi.prd.surveys > hoodie.datasource.hive_sync.mode=hms > hoodie.datasource.hive_sync.enable=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator > hoodie.datasource.write.hive_style_partitioning=true > hoodie.finalize.write.parallelism=256 > hoodie.deltastreamer.source.dfs.root=s3://datalake-bucket/raw/parquet/data/surveys/year=2021/month=12/day=06/hour=16 > hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector > hoodie.parquet.max.file.size=134217728 > hoodie.parquet.small.file.limit=67108864 > hoodie.parquet.block.size=134217728 > hoodie.parquet.compression.codec=snappy > hoodie.file.listing.parallelism=256 > hoodie.upsert.shuffle.parallelism=10 > hoodie.metadata.enable=false > hoodie.metadata.clean.async=true > hoodie.clustering.preserve.commit.metadata=true > hoodie.clustering.inline.max.commits=1 > hoodie.clustering.inline=true > hoodie.clustering.plan.strategy.target.file.max.bytes=134217728 > hoodie.clustering.plan.strategy.small.file.limit=67108864 > hoodie.clustering.plan.strategy.sort.columns=projectid > hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy > hoodie.clean.async=true > hoodie.clean.automatic=true > hoodie.cleaner.policy=KEEP_LATEST_COMMITS > hoodie.cleaner.commits.retained=10 > hoodie.deltastreamer.transformer.sql=SELECT id, sid FROM <SRC> a -- This message was sent by Atlassian Jira (v8.20.1#820001)