Re: Loading Files from HDFS Incurs Network Communication

Jinfeng Li Mon, 26 Oct 2015 02:29:35 -0700

Replication factor is 3 and we have 18 data nodes. We check HDFS webUI,
data is evenly distributed among 18 machines.



On Mon, Oct 26, 2015 at 5:18 PM Sean Owen <so...@cloudera.com> wrote:

> Have a look at your HDFS replication, and where the blocks are for these
> files. For example, if you had only 2 HDFS data nodes, then data would be
> remote to 16 of 18 workers and always entail a copy.
>
> On Mon, Oct 26, 2015 at 9:12 AM, Jinfeng Li <liji...@gmail.com> wrote:
>
>> I cat /proc/net/dev and then take the difference of received bytes before
>> and after the job. I also see a long-time peak (nearly 600Mb/s) in nload
>> interface.  We have 18 machines and each machine receives 4.7G bytes.
>>
>> On Mon, Oct 26, 2015 at 5:00 PM Sean Owen <so...@cloudera.com> wrote:
>>
>>> -dev +user
>>> How are you measuring network traffic?
>>> It's not in general true that there will be zero network traffic, since
>>> not all executors are local to all data. That can be the situation in many
>>> cases but not always.
>>>
>>> On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li <liji...@gmail.com> wrote:
>>>
>>>> Hi, I find that loading files from HDFS can incur huge amount of
>>>> network traffic. Input size is 90G and network traffic is about 80G. By my
>>>> understanding, local files should be read and thus no network communication
>>>> is needed.
>>>>
>>>> I use Spark 1.5.1, and the following is my code:
>>>>
>>>> val textRDD = sc.textFile("hdfs://master:9000/inputDir")
>>>> textRDD.count
>>>>
>>>> Jeffrey
>>>>
>>>
>>>
>

Re: Loading Files from HDFS Incurs Network Communication

Reply via email to