Hi,
Spark is fine with that many Parquet files in general:
# generate 100,000 small Parquet files
spark.range(0, 100, 1, 10).write.parquet("too-many-files.parquet")
# read 100,000 Parquet files
val df = spark.read.parquet("too-many-files.parquet")
df.show()
df.count()
Reading the files
Read by default can't be parallelized in a Spark job, and doing your own
multi-threaded programming in a Spark program isn't a good idea. Adding
fast disk I/O and increase RAM may speed things up, but won't help with
parallelization. You may have to be more creative here. One option
would be,
you may need a large cluster memory and fast disk IO.
Sachit Murarka wrote:
Can anyone please suggest if there is any property to improve the
parallel reads? I am reading more than 25000 files .
--
Simple Mail
https://simplemail.co.in/
Are you trying to run on cloud ?
On Mon, 3 Oct 2022, 21:55 Sachit Murarka, wrote:
> Hello,
>
> I am reading too many files in Spark 3.2(Parquet) . It is not giving any
> error in the logs. But after spark.read.parquet , it is not able to proceed
> further.
> Can anyone please suggest if there is