[GitHub] [hudi] mithalee commented on issue #3336: [SUPPORT] Delete not functioning with deltastreamer

GitBox Mon, 26 Jul 2021 21:36:12 -0700


mithalee commented on issue #3336:
URL: https://github.com/apache/hudi/issues/3336#issuecomment-887201790



   I changed the input data set a bit so that I can provide 3 different fields 
for the 3 differnt mentioned configs.
   ./spark-submit --master k8s://https://..sk1.us-west-1.eks.amazonaws.com 
   --deploy-mode cluster 
   --name spark-hudi 
   --jars 
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.1/spark-avro_2.12-3.1.1.jar
 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
   --conf spark.executor.instances=1 
   --conf spark.kubernetes.container.image=../spark:spark-hudi-0.2 
   --conf spark.kubernetes.namespace=spark-k8 
   --conf spark.kubernetes.container.image.pullSecrets=dockercloud-secret 
   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
   --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-1.amazonaws.com 
   --conf spark.hadoop.fs.s3a.access.key='A...R' 
   --conf spark.hadoop.fs.s3a.secret.key='L..gG' 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
s3a://lightbox-sandbox-dev/hudi-root/spark-submit-jars/hudi-utilities-bundle_2.12-0.8.0.jar
 
   --table-type COPY_ON_WRITE 
   --source-ordering-field ts 
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
   --target-base-path 
s3a://lightbox-sandbox-dev/hudi-root/transformed-tables/hudi_writer_mm4/ 
   --target-table test_table --base-file-format PARQUET 
   --hoodie-conf hoodie.datasource.write.recordkey.field=uuid 
   --hoodie-conf hoodie.datasource.write.partitionpath.field=two 
   --hoodie-conf hoodie.datasource.write.precombine.field=ts 
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://../jen/example_upsert2.parquet 
   
   My input parquet file has below columns:
   import pyarrow.parquet as pq
   
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import uuid
   
   df = pd.DataFrame({'uuid': [str(uuid.uuid4()), 
str(uuid.uuid4()),str(uuid.uuid4())],
   'ts': [1, 2, 3],
   'two': [100, 101, 103],
   'three': [True, True, True],
   '_hoodie_is_deleted': [False, False, False]},
   index=list('abc'))
   table = pa.Table.from_pandas(df)
   print(table)
   pq.write_table(table, 'example_5.parquet')
   
   The above initial insert into Hudi table works successfully. 
   I then tried to perform an update this time as you mentioned:
   import pyarrow.parquet as pq
   
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   import uuid
   
   df = pd.DataFrame({'uuid': [str(uuid.uuid4()), 
str(uuid.uuid4()),str(uuid.uuid4())],
   'ts': [1, 2, 3],
   'two': [100, 101, 103],
   'three': [True, True, False],
   '_hoodie_is_deleted': [False, False, False]},
   index=list('abc'))
   table = pa.Table.from_pandas(df)
   print(table)
   pq.write_table(table, 'example_upsert5.parquet')
   
   SPARK SUBMIT FOR UPSERT:
   ./spark-submit --master k8s://https://..sk1.us-west-1.eks.amazonaws.com 
   --deploy-mode cluster 
   --name spark-hudi 
   --jars 
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.1/spark-avro_2.12-3.1.1.jar
 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
   --conf spark.executor.instances=1 
   --conf spark.kubernetes.container.image=../spark:spark-hudi-0.2 
   --conf spark.kubernetes.namespace=spark-k8 
   --conf spark.kubernetes.container.image.pullSecrets=dockercloud-secret 
   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
   --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-1.amazonaws.com 
   --conf spark.hadoop.fs.s3a.access.key='A...R' 
   --conf spark.hadoop.fs.s3a.secret.key='L..gG' 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
s3a://lightbox-sandbox-dev/hudi-root/spark-submit-jars/hudi-utilities-bundle_2.12-0.8.0.jar
 
   --table-type COPY_ON_WRITE 
   --source-ordering-field ts 
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
   --target-base-path 
s3a://lightbox-sandbox-dev/hudi-root/transformed-tables/hudi_writer_mm4/ 
   --target-table test_table --base-file-format PARQUET 
   --hoodie-conf hoodie.datasource.write.recordkey.field=uuid 
   --hoodie-conf hoodie.datasource.write.partitionpath.field=two 
   --hoodie-conf hoodie.datasource.write.precombine.field=ts 
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://../jen/example_upsert5.parquet 
   This one fails with error. The stack trace from Kubernetes dashboard is 
attached.
   
[logs-from-spark-kubernetes-driver-in-spark-hudi-7c40487ae63865bc-driver.txt](https://github.com/apache/hudi/files/6882689/logs-from-spark-kubernetes-driver-in-spark-hudi-7c40487ae63865bc-driver.txt)
   
![K8_hudi_update_error](https://user-images.githubusercontent.com/64560358/127096013-e92835e0-1b77-439d-94bb-553680a0dd41.PNG)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] mithalee commented on issue #3336: [SUPPORT] Delete not functioning with deltastreamer

Reply via email to