Hi, I have already tried the same code with Spark 1.3.1, there is no such problem. The configuration files are all directly copied from Spark 1.5.1. I feel it is a bug on Spark 1.5.1.
Thanks a lot for your response. On Mon, Oct 26, 2015 at 7:21 PM Sean Owen <so...@cloudera.com> wrote: > Yeah, are these stats actually reflecting data read locally, like through > the loopback interface? I'm also no expert on the internals here but this > may be measuring effectively local reads. Or are you sure it's not? > > On Mon, Oct 26, 2015 at 11:14 AM, Steve Loughran <ste...@hortonworks.com> > wrote: > >> >> > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote: >> > >> > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, >> data is evenly distributed among 18 machines. >> > >> >> >> every block in HDFS (usually 64-128-256 MB) is distributed across three >> machines, meaning 3 machines have it local, 15 have it remote. >> >> for data locality to work properly, you need the executors to be reading >> in the blocks of data local to them, and not data from other parts of the >> files. Spark does try to do locality, but if there's only a limited set of >> executors, then more of the workload is remote vs local. >> >> I don't know of an obvious way to get the metrics here of local vs >> remote; I don't see the HDFS client library tracking that —though it should >> be the place to collect stats on local/remote/domain-socket-direct IO. Does >> anyone know somewhere in the Spark metrics which tracks placement locality? >> If not, both layers could have some more metrics added. > > >