[GitHub] [hudi] kartik18 opened a new issue #5211: [SUPPORT] Glob pattern to pick specific subfolders not working while reading in Spark

GitBox Sat, 02 Apr 2022 00:03:43 -0700


kartik18 opened a new issue #5211:
URL: https://github.com/apache/hudi/issues/5211



   **Describe the problem you faced**
   
   I've S3 directory structure -
   Folder
         |_ cluster=abc_$folder_ //It is a file not directory 
         |_ cluster=abc
              |_dt = 2022-01-01
                |_ A1.parquet
                |_ A2.parquet
         |_ cluster=efg_$folder_
         |_ cluster=efg
              |_dt = 2022-01-01
                |_ B1.parquet
                |_ B2.parquet     
   
   I'm trying to read only the subfolders that contains the parquet file. So 
I'm reading file as followed - 
   
spark.read.format("org.apache.hudi").load("s3://bucket/folder/*[^_$folder_]/dt=2022-01-01/*.parquet)
   
   But it gives an error - 
   
   pypy4j.protocol.Py4JJavaError: An error occurred while calling o127.load.
   java.lang.NullPointerException | : java.lang.NullPointerException
   
org.apache.hudi.HoodieSparkUtils$anonfun$globPath$1$anonfun$1.apply(HoodieSparkUtils.scala:82)
 
   
   However, if I provide to full the path -
   
spark.read.format("org.apache.hudi").load("s3://bucket/folder/cluster=abc/dt=2022-01-01/*.parquet)
   Then it is able to read the data
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create the above folder structure
   2. Execute glob pattern to select only parquet files.
   3. Load the data as Dataframe
   
   **Expected behavior**
   
   It will not be able to apply the glob patterns and proceed for only those 
subfolders that contain the parquet file.
   
   **Environment Description**
   
   * Hudi version : 0.10
   
   * Spark version : 2.4
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   pypy4j.protocol.Py4JJavaError: An error occurred while calling o127.load.
   java.lang.NullPointerException | : java.lang.NullPointerException
   
org.apache.hudi.HoodieSparkUtils$anonfun$globPath$1$anonfun$1.apply(HoodieSparkUtils.scala:82)
 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] kartik18 opened a new issue #5211: [SUPPORT] Glob pattern to pick specific subfolders not working while reading in Spark

Reply via email to