.gz files are not splittable hence harder to process. Easiest is to move to
a splittable compression like lzo and break file into multiple blocks to be
read and for subsequent processing.
On 11 May 2014 09:01, "Soumya Simanta" <soumya.sima...@gmail.com> wrote:

>
>
> I've a Spark cluster with 3 worker nodes.
>
>
>    - *Workers:* 3
>    - *Cores:* 48 Total, 48 Used
>    - *Memory:* 469.8 GB Total, 72.0 GB Used
>
> I want a process a single file compressed (*.gz) on HDFS. The file is
> 1.5GB compressed and 11GB uncompressed.
> When I try to read the compressed file from HDFS it takes a while (4-5
> minutes) load it into an RDD. If I use the .cache operation it takes even
> longer. Is there a way to make loading of the RDD from HDFS faster ?
>
> Thanks
>  -Soumya
>
>
>

Reply via email to