Hi all, I'm consistently finding that reading from HDFS is not appreciably faster than reading from S3 using pyspark. How can I tell whether data locality is being respected?
In this example, reading from HDFS is only about 10% faster than reading the same file from S3. The files were pulled from s3 using S3distcp. (The file size is slightly smaller on HDFS but lets ignore that for now). This was run on an EMR cluster but I have found the same effect using the spark-ec2 script. pageshdfs=sc.textFile('hdfs:///pages/year=2014/month=05/day=01/hour=0000/*') pagess3=sc.textFile('s3n://BUCKETNAME/pages/year=2014/month=05/day=01/hour=0000/*') t=datetime.now(); pageshdfs.count(); datetime.now()-t 5056418 datetime.timedelta(0, 22, 123156) t=datetime.now(); pagess3.count(); datetime.now()-t 5324499 datetime.timedelta(0, 24, 544198) (Script: s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb and ami-version 3.1.0). -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1]