Put many small files in Hadoop Archives (HAR) to improve performance of reading small files. Alternatively have a batch job concatenating them.
> On 11 Feb 2016, at 18:33, Junjie Qian <qian.jun...@outlook.com> wrote: > > Hi all, > > I am working with Spark 1.6, scala and have a big dataset divided into > several small files. > > My question is: right now the read operation takes really long time and often > has RDD warnings. Is there a way I can read the files in parallel, that all > nodes or workers read the file at the same time? > > Many thanks > Junjie