Hi Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text file? Does it use InputFormat do create multiple splits and creates 1 partition per split? Also, in case of S3 or NFS, how does the input split work? I understand for HDFS files are already pre-split so Spark can use dfs.blocksize to determine partitions. But how does it work other than HDFS?
On Thu, Sep 28, 2017 at 11:26 PM, Daniel Siegmann < dsiegm...@securityscorecard.io> wrote: > > no matter what you do and how many nodes you start, in case you have a >> single text file, it will not use parallelism. >> > > This is not true, unless the file is small or is gzipped (gzipped files > cannot be split). > -- Best Regards, Ayan Guha