Re: Reading too many files

Enrico Minack Wed, 05 Oct 2022 03:33:27 -0700

Hi,

Spark is fine with that many Parquet files in general:


# generate 100,000 small Parquet files
spark.range(0, 1000000, 1, 100000).write.parquet("too-many-files.parquet")

# read 100,000 Parquet files
val df = spark.read.parquet("too-many-files.parquet")
df.show()
df.count()

Reading the files takes a few seconds, so there is no problem with thenumber of files.

What exactly do you mean with "But after spark.read.parquet , it is notable to proceed further."?


Does that mean that executing the line
  val df = spark.read.parquet("too-many-files.parquet")
takes forever?

How long do individual tasks take? How many tasks are there for this line?
Where are the Parquet files stored? Where does the Spark job run?

Enrico



Am 03.10.22 um 18:22 schrieb Sachit Murarka:

Hello,
I am reading too many files in Spark 3.2(Parquet) . It is not givingany error in the logs. But after spark.read.parquet , it is not ableto proceed further.Can anyone please suggest if there is any property to improve theparallel reads? I am reading more than 25000 files .
Kind Regards,
Sachit Murarka

Re: Reading too many files

Reply via email to