Hi all,
I'm consistently finding that reading from HDFS is not appreciably faster
than reading from S3 using pyspark. How can I tell whether data locality is
being respected?

In this example, reading from HDFS is only about 10% faster than reading
the same file from S3. The files were pulled from s3 using S3distcp. (The
file size is slightly smaller on HDFS but lets ignore that for now). This
was run on an EMR cluster but I have found the same effect using the
spark-ec2 script.


pageshdfs=sc.textFile('hdfs:///pages/year=2014/month=05/day=01/hour=0000/*')
pagess3=sc.textFile('s3n://BUCKETNAME/pages/year=2014/month=05/day=01/hour=0000/*')

t=datetime.now(); pageshdfs.count(); datetime.now()-t
5056418
datetime.timedelta(0, 22, 123156)


t=datetime.now(); pagess3.count(); datetime.now()-t
5324499
datetime.timedelta(0, 24, 544198)

(Script: s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb
and ami-version 3.1.0).


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]

Reply via email to