Hi Gary,
I gave this a shot on a test cluster of CDH4.7 and actually saw a
regression in performance when running the numbers. Have you done any
benchmarking? Below are my numbers:
Experimental method:
1. Write 14GB of data to HDFS via [1]
2. Read data multiple times via [2]
*Experiment 1: run on virtual machines*
With short-circuit read *disabled*:
14/09/24 15:10:49 INFO spark.SparkContext: Job finished: saveAsTextFile at
<console>:13, took 344.931469949 s
14/09/24 15:11:30 INFO spark.SparkContext: Job finished: count at
<console>:13, took 18.601568871 s
14/09/24 15:11:54 INFO spark.SparkContext: Job finished: count at
<console>:13, took 16.531909024 s
14/09/24 15:12:18 INFO spark.SparkContext: Job finished: count at
<console>:13, took 17.639692651 s
14/09/24 15:12:38 INFO spark.SparkContext: Job finished: count at
<console>:13, took 16.773438345 s
With short-circuit read *enabled*:
14/09/24 14:28:38 INFO spark.SparkContext: Job finished: saveAsTextFile at
<console>:13, took 299.511103592 s
14/09/24 14:29:17 INFO spark.SparkContext: Job finished: count at
<console>:13, took 22.459146194 s
14/09/24 14:29:44 INFO spark.SparkContext: Job finished: count at
<console>:13, took 19.806642815 s
14/09/24 14:30:11 INFO spark.SparkContext: Job finished: count at
<console>:13, took 20.284644308 s
14/09/24 14:30:40 INFO spark.SparkContext: Job finished: count at
<console>:13, took 21.720455219 s
My summary hear is that enabling short-circuit read caused the write to go
faster (what?) and caused a slight decrease in read performance, from
~17sec to ~20sec.
The VMs were backed by FusionIO drives but I thought maybe there was
something funky with the VMs so switched to bare hardware in a second
experiment.
*Experiment 2: run on bare hardware*
With short-circuit read *disabled*:
14/09/24 15:59:11 INFO spark.SparkContext: Job finished: saveAsTextFile at
<console>:13, took 1605.965203162 s
14/09/24 15:59:39 INFO spark.SparkContext: Job finished: count at
<console>:13, took 11.984355461 s
14/09/24 16:00:00 INFO spark.SparkContext: Job finished: count at
<console>:13, took 11.134712764 s
14/09/24 16:00:11 INFO spark.SparkContext: Job finished: count at
<console>:13, took 8.694292372 s
14/09/24 16:00:24 INFO spark.SparkContext: Job finished: count at
<console>:13, took 9.83986823 s
With short-circuit read *enabled*:
14/09/24 16:23:14 INFO spark.SparkContext: Job finished: saveAsTextFile at
<console>:13, took 1113.897715871 s
14/09/24 16:25:19 INFO spark.SparkContext: Job finished: count at
<console>:13, took 14.249690605 s
14/09/24 16:25:47 INFO spark.SparkContext: Job finished: count at
<console>:13, took 12.67330165 s
14/09/24 16:26:04 INFO spark.SparkContext: Job finished: count at
<console>:13, took 10.673825924 s
14/09/24 16:26:19 INFO spark.SparkContext: Job finished: count at
<console>:13, took 9.722516379 s
This is separate hardware so the numbers are very different (it's not just
bypassing the VM overhead).
Again, the writes are much faster (1605s -> 1113s) but the reads are
comparable if not slightly slower (~10.4s -> ~11.8s)
To make sure that short circuit reads were actually working I looked at the
datanode logs and saw the following line. I think this confirms that a)
the read was local (127.0.0.1 -> 127.0.0.1) from Spark and b) short-circuit
read was successfully used ("success: true").
hadoop-datanode-mybox.local.log:2014-09-24 16:26:52,800 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid:
-312380305519226759, srvID: DS-96112752-10.201.12.105-50010-1411586696381,
success: true
Has anyone actually deployed this feature and benchmarked gains? I was
hoping to throw this switch on my clusters and get a 30% perf boost but in
practice that has not materialized.
Cheers!
Andrew
[1] sc.parallelize(1 to (14*1024*1024)).map(k =>
Seq(k, org.apache.commons.lang.RandomStringUtils.random(1024,
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWxyZ0123456789")).mkString("|")).saveAsTextFile("hdfs:///tmp/output")
[2] sc.textFile("hdfs:///tmp/output").count
On Wed, Sep 17, 2014 at 11:19 AM, Matei Zaharia <[email protected]>
wrote:
> I'm pretty sure it does help, though I don't have any numbers for it. In
> any case, Spark will automatically benefit from this if you link it to a
> version of HDFS that contains this.
>
> Matei
>
> On September 17, 2014 at 5:15:47 AM, Gary Malouf ([email protected])
> wrote:
>
> Cloudera had a blog post about this in August 2013:
> http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/
>
> Has anyone been using this in production - curious as to if it made a
> significant difference from a Spark perspective.
>
>