Re: Loading Files from HDFS Incurs Network Communication

Jinfeng Li Mon, 26 Oct 2015 04:40:42 -0700

Hi, I have already tried the same code with Spark 1.3.1, there is no such
problem. The configuration files are all directly copied from Spark 1.5.1.
I feel it is a bug on Spark 1.5.1.


Thanks a lot for your response.

On Mon, Oct 26, 2015 at 7:21 PM Sean Owen <so...@cloudera.com> wrote:

> Yeah, are these stats actually reflecting data read locally, like through
> the loopback interface? I'm also no expert on the internals here but this
> may be measuring effectively local reads. Or are you sure it's not?
>
> On Mon, Oct 26, 2015 at 11:14 AM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>>
>> > On 26 Oct 2015, at 09:28, Jinfeng Li <liji...@gmail.com> wrote:
>> >
>> > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI,
>> data is evenly distributed among 18 machines.
>> >
>>
>>
>> every block in HDFS (usually 64-128-256 MB) is distributed across three
>> machines, meaning 3 machines have it local, 15 have it remote.
>>
>> for data locality to work properly, you need the executors to be reading
>> in the blocks of data local to them, and not data from other parts of the
>> files. Spark does try to do locality, but if there's only a limited set of
>> executors, then more of the workload is remote vs local.
>>
>> I don't know of an obvious way to get the metrics here of local vs
>> remote; I don't see the HDFS client library tracking that —though it should
>> be the place to collect stats on local/remote/domain-socket-direct IO. Does
>> anyone know somewhere in the Spark metrics which tracks placement locality?
>> If not, both layers could have some more metrics added.
>
>
>

Re: Loading Files from HDFS Incurs Network Communication

Reply via email to