[GitHub] [hudi] mithalee opened a new issue #3336: [SUPPORT]

GitBox Fri, 23 Jul 2021 15:53:28 -0700


mithalee opened a new issue #3336:
URL: https://github.com/apache/hudi/issues/3336



   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.Perform a spark submit job to insert data into Hudi table saved in S3. I 
am using the Hudi Delta Streamer utility with the "hoodie_is_deleted" column 
set to "False" for all the records. This works successfully. The data is in 
parquet file in S3.The parquet file is generated from a panda data frame which 
is a different process(not part of the spark submit).
   2. Perform a spark submit job to delete data from the existing Hudi table 
saved in S3.I am using the Hudi Delta Streamer utility with the 
"hoodie_is_deleted" column set to "True" 
   for all the records that need to be deleted. The delete file is in parquet 
format in S3 and is generated from a panda data frame which is a different 
process(not part of the spark submit). This now throws an error and I see roll 
back files in .hoodie folder.
   3. I am running this in Spark on Kubernetes.
    https://spark.apache.org/docs/3.1.1/configuration.html#kubernetes
   
   **Expected behavior**
   I expect to see deletion of the row with "hoodie_is_deleted=True"in the 
existing Hudi table in S3.
    
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.8.0
   
   * Spark version :3.1.1
   https://spark.apache.org/docs/3.1.1/configuration.html#kubernetes
   * Hive version :
   
   * Hadoop version :3.2
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   **SPARK SUMBIT COMMAND FOR INITIAL LOAD TO HUDI TABLE(THIS ONE WORKS 
SUCCESSFULLY):**
   ./spark-submit --master k8s://https://..sk1.us-west-1.eks.amazonaws.com 
   --deploy-mode cluster 
   --name spark-hudi 
   --jars 
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.1/spark-avro_2.12-3.1.1.jar
 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
   --conf spark.executor.instances=1 
   --conf spark.kubernetes.container.image=../spark:spark-hudi-0.2 
   --conf spark.kubernetes.namespace=spark-k8 
   --conf spark.kubernetes.container.image.pullSecrets=dockercloud-secret 
   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
   --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-1.amazonaws.com 
   --conf spark.hadoop.fs.s3a.access.key='A...R' 
   --conf spark.hadoop.fs.s3a.secret.key='L..gG' 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
s3a://lightbox-sandbox-dev/hudi-root/spark-submit-jars/hudi-utilities-bundle_2.12-0.8.0.jar
 
   --table-type COPY_ON_WRITE --source-ordering-field two 
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
   --target-base-path s3a://.../hudi-root/transformed-tables/hudi_writer_mm4/ 
   --target-table test_table --base-file-format PARQUET 
   --hoodie-conf hoodie.datasource.write.recordkey.field=two 
   --hoodie-conf hoodie.datasource.write.partitionpath.field=two 
   --hoodie-conf hoodie.datasource.write.precombine.field=two 
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://../jen/example_2.parquet
   
   **SPARK SUMBIT COMMAND FOR INITIAL LOAD TO HUDI TABLE(THIS ONE THROWS 
ERROR):**
   ./spark-submit --master k8s://https://..sk1.us-west-1.eks.amazonaws.com 
   --deploy-mode cluster 
   --name spark-hudi 
   --jars 
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.1/spark-avro_2.12-3.1.1.jar
 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
   --conf spark.executor.instances=1 
   --conf spark.kubernetes.container.image=../spark:spark-hudi-0.2 
   --conf spark.kubernetes.namespace=spark-k8 
   --conf spark.kubernetes.container.image.pullSecrets=dockercloud-secret 
   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
   --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-1.amazonaws.com 
   --conf spark.hadoop.fs.s3a.access.key='A...R' 
   --conf spark.hadoop.fs.s3a.secret.key='L..gG' 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
s3a://lightbox-sandbox-dev/hudi-root/spark-submit-jars/hudi-utilities-bundle_2.12-0.8.0.jar
 
   --table-type COPY_ON_WRITE --source-ordering-field two 
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
   --target-base-path s3a://.../hudi-root/transformed-tables/hudi_writer_mm4/ 
   --target-table test_table --base-file-format PARQUET 
   --hoodie-conf hoodie.datasource.write.recordkey.field=two 
   --hoodie-conf hoodie.datasource.write.partitionpath.field=two 
   --hoodie-conf hoodie.datasource.write.precombine.field=two 
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://../jen/example_upsert2.parquet 
   
   Add any other context about the problem here.
   **Code to generate the parquet files that I am using:**
   **New insert parquet file(example2.parquet)**
   
   import pyarrow.parquet as pq
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   df = pd.DataFrame({'one': [-1, 3, 2.5],
   'two': [100, 101, 103],
   'three': [True, True, True],
   '_hoodie_is_deleted': [False, False, False]},
   index=list('abc'))
   table = pa.Table.from_pandas(df)
   print(table)
   pq.write_table(table, 'example_2.parquet')
   
   **Code to generate the parquet file that I am using:**
   **Upsert/delete parquet file(example_upsert2.parquet)**
   import pyarrow.parquet as pq
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   df = pd.DataFrame({'one': [-1, 3, 2.5],
   'two': [100, 101, 103],
   'three': [True, True, True],
   '_hoodie_is_deleted': [False, False, True]},
   index=list('abc'))
   table = pa.Table.from_pandas(df)
   print(table)
   pq.write_table(table, 'example_upsert2.parquet')
   
   **Stacktrace**
   Stack trace from the Kubernetes dashboard is attached.
   
![K8_hudi_error](https://user-images.githubusercontent.com/64560358/126848278-55935f9b-ba01-465d-9c4e-b42a6462625b.PNG)
   
![S3_hudi_rollback](https://user-images.githubusercontent.com/64560358/126848319-fa877c7f-eb1f-4f24-94f2-eadeeb7c21e6.PNG)
   
   
   ```Add the stacktrace of the error.```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] mithalee opened a new issue #3336: [SUPPORT]

Reply via email to