Hi,
Spark is fine with that many Parquet files in general:
# generate 100,000 small Parquet files
spark.range(0, 1000000, 1, 100000).write.parquet("too-many-files.parquet")
# read 100,000 Parquet files
val df = spark.read.parquet("too-many-files.parquet")
df.show()
df.count()
Reading the files takes a few seconds, so there is no problem with the
number of files.
What exactly do you mean with "But after spark.read.parquet , it is not
able to proceed further."?
Does that mean that executing the line
val df = spark.read.parquet("too-many-files.parquet")
takes forever?
How long do individual tasks take? How many tasks are there for this line?
Where are the Parquet files stored? Where does the Spark job run?
Enrico
Am 03.10.22 um 18:22 schrieb Sachit Murarka:
Hello,
I am reading too many files in Spark 3.2(Parquet) . It is not giving
any error in the logs. But after spark.read.parquet , it is not able
to proceed further.
Can anyone please suggest if there is any property to improve the
parallel reads? I am reading more than 25000 files .
Kind Regards,
Sachit Murarka