Put many small files in Hadoop Archives (HAR) to improve performance of reading 
small files. Alternatively have a batch job concatenating them.

> On 11 Feb 2016, at 18:33, Junjie Qian <qian.jun...@outlook.com> wrote:
> 
> Hi all,
> 
> I am working with Spark 1.6, scala and have a big dataset divided into 
> several small files.
> 
> My question is: right now the read operation takes really long time and often 
> has RDD warnings. Is there a way I can read the files in parallel, that all 
> nodes or workers read the file at the same time?
> 
> Many thanks
> Junjie

Reply via email to