murphp15 opened a new issue #2811:
URL: https://github.com/apache/hudi/issues/2811


   **Describe the problem you faced**
   
   I want to write to a gcs bucket from dataproc using hudi.
   
   To write to gcs using hudi it says to set prop fs.defaultFS to value gs:// 
(https://hudi.apache.org/docs/gcs_hoodie)
   
   However when I set fs.defaultFS on dataproc to be a gcs bucket I get errors 
at startup relating to the job not being able to find my jar. It is looking in 
a gs:/ prefix, presumably because I have overridden defaultFs which is was 
previously using the find the jar. How would I fix this?
   
   
   ```
   org.apache.spark.SparkException: Application application_1617963833977_0009 
failed 2 times due to AM Container for appattempt_1617963833977_0009_000002 
exited with  exitCode: -1000
   Failing this attempt.Diagnostics: [2021-04-12 
15:36:05.142]java.io.FileNotFoundException: File not found : 
gs:/user/root/.sparkStaging/application_1617963833977_0009/myjar.jar
   ```
   
   If it is relevant I am setting the defaultFs from within the code. 
sparkConfig.set("spark.hadoop.fs.defaultFS", gs://defaultFs)
   
   Is there anyway to use hive without requiring the defaultFS property? 
   **To Reproduce**
   Run a job on dataproc and set defaultFS to be gs://mybucket
   
   **Expected behavior**
   
   Ideally hudi wouldn't be dependant on the defaultFS property as it seems to 
cause issues for dataproc. 
   
   **Environment Description**
   Dataproc writing to GCS
    
   * Hudi version : 0.7.0
   
   * Spark version : 2.4.7 
   
   * Hive version : Not using hive 
   
   * Hadoop version : 2.7
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : No
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to