[jira] [Updated] (HUDI-2943) Deltastreamer fails to continue with pending clustering after restart in 0.10.0 and inline clustering

ASF GitHub Bot (Jira) Wed, 12 Jan 2022 05:23:04 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HUDI-2943:
---------------------------------
    Labels: core-flow-ds pull-request-available sev:high  (was: core-flow-ds 
sev:high)

> Deltastreamer fails to continue with pending clustering after restart in 
> 0.10.0 and inline clustering
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-2943
>                 URL: https://issues.apache.org/jira/browse/HUDI-2943
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: DeltaStreamer
>            Reporter: Harsha Teja Kanna
>            Assignee: sivabalan narayanan
>            Priority: Major
>              Labels: core-flow-ds, pull-request-available, sev:high
>         Attachments: image-2021-12-08-15-10-02-420.png
>
>
> Deltastreamer fails to restart when there is a pending clustering commit from 
> a previous run with Upsert failed exception when inline clustering is on.
> {*}Note{*}: workaround of running Clustering job with 
> --retry-last-failed-clustering-job works
> Hudi version : 0.10.0
> Spark version : 3.1.2
> EMR : 6.4.0
> diagnostics: User class threw exception: 
> org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit 
> time 20211206081248919
> at 
> org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62)
> at 
> org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46)
> at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119)
> at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:159)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:501)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:306)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
> at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
> Caused by: org.apache.hudi.exception.HoodieClusteringUpdateException: Not 
> allowed to update the clustering file group 
> HoodieFileGroupId\{partitionPath='', 
> fileId='39ca735d-1fc4-40f9-a314-93744642b38c-0'}. For pending clustering 
> operations, we are not going to support update for now.
> at 
> org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy.lambda$handleUpdate$0(SparkRejectUpdateStrategy.java:65)
> Config:
> hoodie.index.type=GLOBAL_SIMPLE
> hoodie.datasource.write.partitionpath.field=
> hoodie.datasource.write.precombine.field=updatedate
> hoodie.datasource.hive_sync.database=datalake
> hoodie.datasource.write.operation=upsert
> hoodie.datasource.hive_sync.table=hudi.prd.surveys
> hoodie.datasource.hive_sync.mode=hms
> hoodie.datasource.hive_sync.enable=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.finalize.write.parallelism=256
> hoodie.deltastreamer.source.dfs.root=s3://datalake-bucket/raw/parquet/data/surveys/year=2021/month=12/day=06/hour=16
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
> hoodie.parquet.max.file.size=134217728
> hoodie.parquet.small.file.limit=67108864
> hoodie.parquet.block.size=134217728
> hoodie.parquet.compression.codec=snappy
> hoodie.file.listing.parallelism=256
> hoodie.upsert.shuffle.parallelism=10
> hoodie.metadata.enable=false
> hoodie.metadata.clean.async=true
> hoodie.clustering.preserve.commit.metadata=true
> hoodie.clustering.inline.max.commits=1
> hoodie.clustering.inline=true
> hoodie.clustering.plan.strategy.target.file.max.bytes=134217728
> hoodie.clustering.plan.strategy.small.file.limit=67108864
> hoodie.clustering.plan.strategy.sort.columns=projectid
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
> hoodie.clean.async=true
> hoodie.clean.automatic=true
> hoodie.cleaner.policy=KEEP_LATEST_COMMITS
> hoodie.cleaner.commits.retained=10
> hoodie.deltastreamer.transformer.sql=SELECT id, sid FROM <SRC> a



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2943) Deltastreamer fails to continue with pending clustering after restart in 0.10.0 and inline clustering

Reply via email to