Reading parquet files in parallel on the cluster

Eric Beabes Tue, 25 May 2021 10:24:07 -0700

I've a use case in which I need to read Parquet files in parallel from over
1000+ directories. I am doing something like this:


   val df = list.toList.toDF()

    df.foreach(c => {
      val config = *getConfigs()*
      doSomething(spark, config)
    })


In the doSomething method, when I try to do this:

val df1 = spark.read.parquet(pathToRead).collect()


I get a NullPointer exception given below. It seems the 'spark.read'
only works on the Driver not on the cluster. How can I do what I want
to do? Please let me know. Thank you.


21/05/25 17:03:50 WARN TaskSetManager: Lost task 2.0 in stage 8.0 (TID
9, ip-10-0-5-3.us-west-2.compute.internal, executor 11):
java.lang.NullPointerException

        at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)

        at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142)

        at 
org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:789)

        at org.apache.spark.sql.SparkSession.read(SparkSession.scala:656)

Reading parquet files in parallel on the cluster

Reply via email to