Hi

Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text
file? Does it use InputFormat do create multiple splits and creates 1
partition per split? Also, in case of S3 or NFS, how does the input split
work? I understand for HDFS files are already pre-split so Spark can use
dfs.blocksize to determine partitions. But how does it work other than HDFS?

On Thu, Sep 28, 2017 at 11:26 PM, Daniel Siegmann <
dsiegm...@securityscorecard.io> wrote:

>
> no matter what you do and how many nodes you start, in case you have a
>> single text file, it will not use parallelism.
>>
>
> This is not true, unless the file is small or is gzipped (gzipped files
> cannot be split).
>



-- 
Best Regards,
Ayan Guha

Reply via email to