[jira] [Updated] (SPARK-47842) Spark job relying over Hudi are blocked after one or zero commit

Dongjoon Hyun (Jira) Sat, 22 Feb 2025 22:46:31 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-47842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dongjoon Hyun updated SPARK-47842:
----------------------------------
    Priority: Major  (was: Blocker)

> Spark job relying over Hudi are blocked after one or zero commit
> ----------------------------------------------------------------
>
>                 Key: SPARK-47842
>                 URL: https://issues.apache.org/jira/browse/SPARK-47842
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Structured Streaming
>    Affects Versions: 3.3.0
>         Environment: Hudi version : 0.12.1-amzn-0
> Spark version : 3.3.0
> Hive version : 3.1.3
> Hadoop version : 3.3.3 amz
> Storage (HDFS/S3/GCS..) : S3
> Running on Docker? (yes/no) : no (EMR 6.9.0)
> Additional context
>            Reporter: alessandro pontis
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: console_spark.png
>
>
> Hello, we are facing the fact that some pyspark job that rely on Hudi seems 
> to be blocked, in fact if we go over the spark console we can see the 
> situation in the attachment
> we can see that we have 71 completed jobs but those are CDC process that 
> should read from Kafka topic continuously. We verified yet that there are 
> messages queued over the kafka topic. If you kill the application and then 
> restart in some cases the job will act normally and other times the job still 
> remain stacked.
> Our deploy condition are the following:
> We read INSERT, UPDATE and DELETE operation from a Kafka topic and we 
> replicate them in a target hudi table stored on Hive via a pyspark job 
> running 24/7
>  
> PYSPARK WRITE
> df_source.writeStream.foreachBatch(foreach_batch_write_function)
>  {{ FOR EACH BATCH FUNCTION:
> #management of delete messages
> batchDF_deletes.write.format('hudi') \
> .option('hoodie.datasource.write.operation', 'delete') \
> .options(**hudiOptions_table) \
> .mode('append') \
> .save(S3_OUTPUT_PATH)
> #management of update and insert messages
> batchDF_upserts.write.format('org.apache.hudi') \
> .option('hoodie.datasource.write.operation', 'upsert') \
> .options(**hudiOptions_table) \
> .mode('append') \
> .save(S3_OUTPUT_PATH)}}
>  
> SPARK SUBMIT
> spark-submit --master yarn --deploy-mode cluster --num-executors 1 
> --executor-memory 1G --executor-cores 2 --conf 
> spark.dynamicAllocation.enabled=false --packages 
> org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
> spark.sql.hive.convertMetastoreParquet=false --jars 
> /usr/lib/hudi/hudi-spark-bundle.jar <path_to_script>



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47842) Spark job relying over Hudi are blocked after one or zero commit

Reply via email to