Re: Reading too many files

2022-10-05 Thread Enrico Minack
Hi, Spark is fine with that many Parquet files in general: # generate 100,000 small Parquet files spark.range(0, 100, 1, 10).write.parquet("too-many-files.parquet") # read 100,000 Parquet files val df = spark.read.parquet("too-many-files.parquet") df.show() df.count() Reading the files

Re: Reading too many files

2022-10-04 Thread Artemis User
Read by default can't be parallelized in a Spark job, and doing your own multi-threaded programming in a Spark program isn't a good idea.  Adding fast disk I/O and increase RAM may speed things up, but won't help with parallelization. You may have to be more creative here.  One option would be,

Re: Reading too many files

2022-10-03 Thread Henrik Pang
you may need a large cluster memory and fast disk IO. Sachit Murarka wrote: Can anyone please suggest if there is any property to improve the parallel reads? I am reading more than 25000 files . -- Simple Mail https://simplemail.co.in/

Re: Reading too many files

2022-10-03 Thread Sid
Are you trying to run on cloud ? On Mon, 3 Oct 2022, 21:55 Sachit Murarka, wrote: > Hello, > > I am reading too many files in Spark 3.2(Parquet) . It is not giving any > error in the logs. But after spark.read.parquet , it is not able to proceed > further. > Can anyone please suggest if there is