Re: S3 vs HDFS

2015-07-12 Thread Steve Loughran
On 11 Jul 2015, at 19:20, Aaron Davidson mailto:ilike...@gmail.com>> wrote: Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I

Re: S3 vs HDFS

2015-07-11 Thread Aaron Davidson
Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned. On Sat, Jul 11, 2015 at 11:14 AM, Steve

Re: S3 vs HDFS

2015-07-11 Thread Steve Loughran
seek() is very, very expensive on s3, even short forward seeks. If your code does a lot of, it will kill performance. (forward seeks are better in s3a, which with Hadoop 2.3 is now something safe to use, and in the s3 client that Amazon include in EMR), but its still sluggish. The other killers

Re: S3 vs HDFS

2015-07-09 Thread Daniel Darabos
I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://:9000/. (This is the ephemeral HDFS cluster, which uses SSD.) I just tested our application this way yesterda

Re: S3 vs HDFS

2015-07-09 Thread Sujee Maniyam
latency is much bigger for S3 (if that matters) And with HDFS you'd get data-locality that will boost your app performance. I did some light experimenting on this. see my presentation here for some benchmark numbers ..etc http://www.slideshare.net/sujee/hadoop-to-sparkv2 from slide# 34 cheers Suj

Re: S3 vs HDFS

2015-07-09 Thread Akhil Das
S3 will obviously add a network lag, whereas in HDFS, if your spark executors are running on the same data-nodes you have the advantage of data locality. Thanks Best Regards On Thu, Jul 9, 2015 at 12:05 PM, Brandon White wrote: > Are there any significant performance differences between reading

S3 vs HDFS

2015-07-08 Thread Brandon White
Are there any significant performance differences between reading text files from S3 and hdfs?