I've a use case in which I need to read Parquet files in parallel from over 1000+ directories. I am doing something like this:
val df = list.toList.toDF() df.foreach(c => { val config = *getConfigs()* doSomething(spark, config) }) In the doSomething method, when I try to do this: val df1 = spark.read.parquet(pathToRead).collect() I get a NullPointer exception given below. It seems the 'spark.read' only works on the Driver not on the cluster. How can I do what I want to do? Please let me know. Thank you. 21/05/25 17:03:50 WARN TaskSetManager: Lost task 2.0 in stage 8.0 (TID 9, ip-10-0-5-3.us-west-2.compute.internal, executor 11): java.lang.NullPointerException at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142) at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:789) at org.apache.spark.sql.SparkSession.read(SparkSession.scala:656)