On 11 Jul 2015, at 19:20, Aaron Davidson
mailto:ilike...@gmail.com>> wrote:
Note that if you use multi-part upload, each part becomes 1 block, which allows
for multiple concurrent readers. One would typically use fixed-size block sizes
which align with Spark's default HDFS block size (64 MB, I
Note that if you use multi-part upload, each part becomes 1 block, which
allows for multiple concurrent readers. One would typically use fixed-size
block sizes which align with Spark's default HDFS block size (64 MB, I
think) to ensure the reads are aligned.
On Sat, Jul 11, 2015 at 11:14 AM, Steve
seek() is very, very expensive on s3, even short forward seeks. If your code
does a lot of, it will kill performance. (forward seeks are better in s3a,
which with Hadoop 2.3 is now something safe to use, and in the s3 client that
Amazon include in EMR), but its still sluggish.
The other killers
I recommend testing it for yourself. Even if you have no application, you
can just run the spark-ec2 script, log in, run spark-shell and try reading
files from an S3 bucket and from hdfs://:9000/. (This is the
ephemeral HDFS cluster, which uses SSD.)
I just tested our application this way yesterda
latency is much bigger for S3 (if that matters)
And with HDFS you'd get data-locality that will boost your app performance.
I did some light experimenting on this.
see my presentation here for some benchmark numbers ..etc
http://www.slideshare.net/sujee/hadoop-to-sparkv2
from slide# 34
cheers
Suj
S3 will obviously add a network lag, whereas in HDFS, if your spark
executors are running on the same data-nodes you have the advantage of data
locality.
Thanks
Best Regards
On Thu, Jul 9, 2015 at 12:05 PM, Brandon White
wrote:
> Are there any significant performance differences between reading
Are there any significant performance differences between reading text
files from S3 and hdfs?