[ https://issues.apache.org/jira/browse/SPARK-47842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-47842: ---------------------------------- Priority: Major (was: Blocker) > Spark job relying over Hudi are blocked after one or zero commit > ---------------------------------------------------------------- > > Key: SPARK-47842 > URL: https://issues.apache.org/jira/browse/SPARK-47842 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming > Affects Versions: 3.3.0 > Environment: Hudi version : 0.12.1-amzn-0 > Spark version : 3.3.0 > Hive version : 3.1.3 > Hadoop version : 3.3.3 amz > Storage (HDFS/S3/GCS..) : S3 > Running on Docker? (yes/no) : no (EMR 6.9.0) > Additional context > Reporter: alessandro pontis > Priority: Major > Labels: pull-request-available > Attachments: console_spark.png > > > Hello, we are facing the fact that some pyspark job that rely on Hudi seems > to be blocked, in fact if we go over the spark console we can see the > situation in the attachment > we can see that we have 71 completed jobs but those are CDC process that > should read from Kafka topic continuously. We verified yet that there are > messages queued over the kafka topic. If you kill the application and then > restart in some cases the job will act normally and other times the job still > remain stacked. > Our deploy condition are the following: > We read INSERT, UPDATE and DELETE operation from a Kafka topic and we > replicate them in a target hudi table stored on Hive via a pyspark job > running 24/7 > > PYSPARK WRITE > df_source.writeStream.foreachBatch(foreach_batch_write_function) > {{ FOR EACH BATCH FUNCTION: > #management of delete messages > batchDF_deletes.write.format('hudi') \ > .option('hoodie.datasource.write.operation', 'delete') \ > .options(**hudiOptions_table) \ > .mode('append') \ > .save(S3_OUTPUT_PATH) > #management of update and insert messages > batchDF_upserts.write.format('org.apache.hudi') \ > .option('hoodie.datasource.write.operation', 'upsert') \ > .options(**hudiOptions_table) \ > .mode('append') \ > .save(S3_OUTPUT_PATH)}} > > SPARK SUBMIT > spark-submit --master yarn --deploy-mode cluster --num-executors 1 > --executor-memory 1G --executor-cores 2 --conf > spark.dynamicAllocation.enabled=false --packages > org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer --conf > spark.sql.hive.convertMetastoreParquet=false --jars > /usr/lib/hudi/hudi-spark-bundle.jar <path_to_script> -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org