Re: Reading parquet files in parallel on the cluster

Sean Owen Tue, 25 May 2021 12:41:45 -0700

Right, you can't use Spark within Spark.
Do you actually need to read Parquet like this vs spark.read.parquet?
that's also parallel of course.
You'd otherwise be reading the files directly in your function with the
Parquet APIs.


On Tue, May 25, 2021 at 12:24 PM Eric Beabes <mailinglist...@gmail.com>
wrote:

> I've a use case in which I need to read Parquet files in parallel from
> over 1000+ directories. I am doing something like this:
>
>    val df = list.toList.toDF()
>
>     df.foreach(c => {
>       val config = *getConfigs()*
>       doSomething(spark, config)
>     })
>
>
> In the doSomething method, when I try to do this:
>
> val df1 = spark.read.parquet(pathToRead).collect()
>
>
> I get a NullPointer exception given below. It seems the 'spark.read' only 
> works on the Driver not on the cluster. How can I do what I want to do? 
> Please let me know. Thank you.
>
>
> 21/05/25 17:03:50 WARN TaskSetManager: Lost task 2.0 in stage 8.0 (TID 9, 
> ip-10-0-5-3.us-west-2.compute.internal, executor 11): 
> java.lang.NullPointerException
>
>         at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
>
>         at 
> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142)
>
>         at 
> org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:789)
>
>         at org.apache.spark.sql.SparkSession.read(SparkSession.scala:656)
>
>

Re: Reading parquet files in parallel on the cluster

Reply via email to