S3 can be realized cheaper than HDFS on Amazon.
As you correctly describe it does not support data locality. The data is
distributed to the workers.
Depending on your use case it can make sense to have HDFS as a temporary
“cache” for S3 data.
> On 13. Dec 2017, at 09:39, Philip Lee wrote:
>
> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be
> spread on Workers?
The data is read by workers. Only make sure that the data is splittable, by
using a splittable
format or by passing a list of files
sc.textFile('s3://.../*.txt')
to achieve full parallelism. Othe