mithalee opened a new issue #3336: URL: https://github.com/apache/hudi/issues/3336
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1.Perform a spark submit job to insert data into Hudi table saved in S3. I am using the Hudi Delta Streamer utility with the "hoodie_is_deleted" column set to "False" for all the records. This works successfully. The data is in parquet file in S3.The parquet file is generated from a panda data frame which is a different process(not part of the spark submit). 2. Perform a spark submit job to delete data from the existing Hudi table saved in S3.I am using the Hudi Delta Streamer utility with the "hoodie_is_deleted" column set to "True" for all the records that need to be deleted. The delete file is in parquet format in S3 and is generated from a panda data frame which is a different process(not part of the spark submit). This now throws an error and I see roll back files in .hoodie folder. 3. I am running this in Spark on Kubernetes. https://spark.apache.org/docs/3.1.1/configuration.html#kubernetes **Expected behavior** I expect to see deletion of the row with "hoodie_is_deleted=True"in the existing Hudi table in S3. A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.8.0 * Spark version :3.1.1 https://spark.apache.org/docs/3.1.1/configuration.html#kubernetes * Hive version : * Hadoop version :3.2 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : yes **Additional context** **SPARK SUMBIT COMMAND FOR INITIAL LOAD TO HUDI TABLE(THIS ONE WORKS SUCCESSFULLY):** ./spark-submit --master k8s://https://..sk1.us-west-1.eks.amazonaws.com --deploy-mode cluster --name spark-hudi --jars https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.1/spark-avro_2.12-3.1.1.jar --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=../spark:spark-hudi-0.2 --conf spark.kubernetes.namespace=spark-k8 --conf spark.kubernetes.container.image.pullSecrets=dockercloud-secret --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-1.amazonaws.com --conf spark.hadoop.fs.s3a.access.key='A...R' --conf spark.hadoop.fs.s3a.secret.key='L..gG' --conf spark.serializer=org.apache.spark.serializer.KryoSerializer s3a://lightbox-sandbox-dev/hudi-root/spark-submit-jars/hudi-utilities-bundle_2.12-0.8.0.jar --table-type COPY_ON_WRITE --source-ordering-field two --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path s3a://.../hudi-root/transformed-tables/hudi_writer_mm4/ --target-table test_table --base-file-format PARQUET --hoodie-conf hoodie.datasource.write.recordkey.field=two --hoodie-conf hoodie.datasource.write.partitionpath.field=two --hoodie-conf hoodie.datasource.write.precombine.field=two --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://../jen/example_2.parquet **SPARK SUMBIT COMMAND FOR INITIAL LOAD TO HUDI TABLE(THIS ONE THROWS ERROR):** ./spark-submit --master k8s://https://..sk1.us-west-1.eks.amazonaws.com --deploy-mode cluster --name spark-hudi --jars https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.1/spark-avro_2.12-3.1.1.jar --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=../spark:spark-hudi-0.2 --conf spark.kubernetes.namespace=spark-k8 --conf spark.kubernetes.container.image.pullSecrets=dockercloud-secret --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-1.amazonaws.com --conf spark.hadoop.fs.s3a.access.key='A...R' --conf spark.hadoop.fs.s3a.secret.key='L..gG' --conf spark.serializer=org.apache.spark.serializer.KryoSerializer s3a://lightbox-sandbox-dev/hudi-root/spark-submit-jars/hudi-utilities-bundle_2.12-0.8.0.jar --table-type COPY_ON_WRITE --source-ordering-field two --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path s3a://.../hudi-root/transformed-tables/hudi_writer_mm4/ --target-table test_table --base-file-format PARQUET --hoodie-conf hoodie.datasource.write.recordkey.field=two --hoodie-conf hoodie.datasource.write.partitionpath.field=two --hoodie-conf hoodie.datasource.write.precombine.field=two --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://../jen/example_upsert2.parquet Add any other context about the problem here. **Code to generate the parquet files that I am using:** **New insert parquet file(example2.parquet)** import pyarrow.parquet as pq import numpy as np import pandas as pd import pyarrow as pa df = pd.DataFrame({'one': [-1, 3, 2.5], 'two': [100, 101, 103], 'three': [True, True, True], '_hoodie_is_deleted': [False, False, False]}, index=list('abc')) table = pa.Table.from_pandas(df) print(table) pq.write_table(table, 'example_2.parquet') **Code to generate the parquet file that I am using:** **Upsert/delete parquet file(example_upsert2.parquet)** import pyarrow.parquet as pq import numpy as np import pandas as pd import pyarrow as pa df = pd.DataFrame({'one': [-1, 3, 2.5], 'two': [100, 101, 103], 'three': [True, True, True], '_hoodie_is_deleted': [False, False, True]}, index=list('abc')) table = pa.Table.from_pandas(df) print(table) pq.write_table(table, 'example_upsert2.parquet') **Stacktrace** Stack trace from the Kubernetes dashboard is attached.   ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
