Why not just read from Spark as normal? Do these files have different or incompatible schemas?
val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths) From: Eric Beabes <mailinglist...@gmail.com> Date: Tuesday, May 25, 2021 at 1:24 PM To: spark-user <user@spark.apache.org> Subject: Reading parquet files in parallel on the cluster I've a use case in which I need to read Parquet files in parallel from over 1000+ directories. I am doing something like this: val df = list.toList.toDF() df.foreach(c => { val config = getConfigs() doSomething(spark, config) }) In the doSomething method, when I try to do this: val df1 = spark.read.parquet(pathToRead).collect() I get a NullPointer exception given below. It seems the 'spark.read' only works on the Driver not on the cluster. How can I do what I want to do? Please let me know. Thank you. 21/05/25 17:03:50 WARN TaskSetManager: Lost task 2.0 in stage 8.0 (TID 9, ip-10-0-5-3.us-west-2.compute.internal, executor 11): java.lang.NullPointerException at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142) at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:789) at org.apache.spark.sql.SparkSession.read(SparkSession.scala:656)