Reading from HDFS no faster than reading from S3 - how to tell if data locality respected?

Martin Goodson Fri, 01 Aug 2014 02:45:08 -0700

Hi all,
I'm consistently finding that reading from HDFS is not appreciably faster
than reading from S3 using pyspark. How can I tell whether data locality is
being respected?


In this example, reading from HDFS is only about 10% faster than reading
the same file from S3. The files were pulled from s3 using S3distcp. (The
file size is slightly smaller on HDFS but lets ignore that for now). This
was run on an EMR cluster but I have found the same effect using the
spark-ec2 script.


pageshdfs=sc.textFile('hdfs:///pages/year=2014/month=05/day=01/hour=0000/*')
pagess3=sc.textFile('s3n://BUCKETNAME/pages/year=2014/month=05/day=01/hour=0000/*')

t=datetime.now(); pageshdfs.count(); datetime.now()-t
5056418
datetime.timedelta(0, 22, 123156)


t=datetime.now(); pagess3.count(); datetime.now()-t
5324499
datetime.timedelta(0, 24, 544198)

(Script: s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb
and ami-version 3.1.0).


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]

Reading from HDFS no faster than reading from S3 - how to tell if data locality respected?

Reply via email to