mithalee commented on issue #3336: URL: https://github.com/apache/hudi/issues/3336#issuecomment-887201790
I changed the input data set a bit so that I can provide 3 different fields for the 3 differnt mentioned configs. ./spark-submit --master k8s://https://..sk1.us-west-1.eks.amazonaws.com --deploy-mode cluster --name spark-hudi --jars https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.1/spark-avro_2.12-3.1.1.jar --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=../spark:spark-hudi-0.2 --conf spark.kubernetes.namespace=spark-k8 --conf spark.kubernetes.container.image.pullSecrets=dockercloud-secret --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-1.amazonaws.com --conf spark.hadoop.fs.s3a.access.key='A...R' --conf spark.hadoop.fs.s3a.secret.key='L..gG' --conf spark.serializer=org.apache.spark.serializer.KryoSerializer s3a://lightbox-sandbox-dev/hudi-root/spark-submit-jars/hudi-utilities-bundle_2.12-0.8.0.jar --table-type COPY_ON_WRITE --source-ordering-field ts --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path s3a://lightbox-sandbox-dev/hudi-root/transformed-tables/hudi_writer_mm4/ --target-table test_table --base-file-format PARQUET --hoodie-conf hoodie.datasource.write.recordkey.field=uuid --hoodie-conf hoodie.datasource.write.partitionpath.field=two --hoodie-conf hoodie.datasource.write.precombine.field=ts --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://../jen/example_upsert2.parquet My input parquet file has below columns: import pyarrow.parquet as pq import numpy as np import pandas as pd import pyarrow as pa import uuid df = pd.DataFrame({'uuid': [str(uuid.uuid4()), str(uuid.uuid4()),str(uuid.uuid4())], 'ts': [1, 2, 3], 'two': [100, 101, 103], 'three': [True, True, True], '_hoodie_is_deleted': [False, False, False]}, index=list('abc')) table = pa.Table.from_pandas(df) print(table) pq.write_table(table, 'example_5.parquet') The above initial insert into Hudi table works successfully. I then tried to perform an update this time as you mentioned: import pyarrow.parquet as pq import numpy as np import pandas as pd import pyarrow as pa import uuid df = pd.DataFrame({'uuid': [str(uuid.uuid4()), str(uuid.uuid4()),str(uuid.uuid4())], 'ts': [1, 2, 3], 'two': [100, 101, 103], 'three': [True, True, False], '_hoodie_is_deleted': [False, False, False]}, index=list('abc')) table = pa.Table.from_pandas(df) print(table) pq.write_table(table, 'example_upsert5.parquet') SPARK SUBMIT FOR UPSERT: ./spark-submit --master k8s://https://..sk1.us-west-1.eks.amazonaws.com --deploy-mode cluster --name spark-hudi --jars https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.1/spark-avro_2.12-3.1.1.jar --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=../spark:spark-hudi-0.2 --conf spark.kubernetes.namespace=spark-k8 --conf spark.kubernetes.container.image.pullSecrets=dockercloud-secret --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-1.amazonaws.com --conf spark.hadoop.fs.s3a.access.key='A...R' --conf spark.hadoop.fs.s3a.secret.key='L..gG' --conf spark.serializer=org.apache.spark.serializer.KryoSerializer s3a://lightbox-sandbox-dev/hudi-root/spark-submit-jars/hudi-utilities-bundle_2.12-0.8.0.jar --table-type COPY_ON_WRITE --source-ordering-field ts --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path s3a://lightbox-sandbox-dev/hudi-root/transformed-tables/hudi_writer_mm4/ --target-table test_table --base-file-format PARQUET --hoodie-conf hoodie.datasource.write.recordkey.field=uuid --hoodie-conf hoodie.datasource.write.partitionpath.field=two --hoodie-conf hoodie.datasource.write.precombine.field=ts --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://../jen/example_upsert5.parquet This one fails with error. The stack trace from Kubernetes dashboard is attached. [logs-from-spark-kubernetes-driver-in-spark-hudi-7c40487ae63865bc-driver.txt](https://github.com/apache/hudi/files/6882689/logs-from-spark-kubernetes-driver-in-spark-hudi-7c40487ae63865bc-driver.txt)  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
