On 26 Oct 2015, at 11:21, Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>> wrote:

Yeah, are these stats actually reflecting data read locally, like through the 
loopback interface? I'm also no expert on the internals here but this may be 
measuring effectively local reads. Or are you sure it's not?


HDFS stats are really the general filesystem stats: they measure data through 
the input and output streams, not whether they were to/from local or remote 
systems. Fixable, and metrics are always good, though as Hadoop (currently) 
uses Hadoop metrics 2, not the codahale APIs, it's not seamless to glue it up 
with the spark context metric registry

On Mon, Oct 26, 2015 at 11:14 AM, Steve Loughran 
<ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:

> On 26 Oct 2015, at 09:28, Jinfeng Li 
> <liji...@gmail.com<mailto:liji...@gmail.com>> wrote:
>
> Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data 
> is evenly distributed among 18 machines.
>


every block in HDFS (usually 64-128-256 MB) is distributed across three 
machines, meaning 3 machines have it local, 15 have it remote.

for data locality to work properly, you need the executors to be reading in the 
blocks of data local to them, and not data from other parts of the files. Spark 
does try to do locality, but if there's only a limited set of executors, then 
more of the workload is remote vs local.

I don't know of an obvious way to get the metrics here of local vs remote; I 
don't see the HDFS client library tracking that —though it should be the place 
to collect stats on local/remote/domain-socket-direct IO. Does anyone know 
somewhere in the Spark metrics which tracks placement locality? If not, both 
layers could have some more metrics added.


Reply via email to