Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data is evenly distributed among 18 machines.
On Mon, Oct 26, 2015 at 5:18 PM Sean Owen <so...@cloudera.com> wrote: > Have a look at your HDFS replication, and where the blocks are for these > files. For example, if you had only 2 HDFS data nodes, then data would be > remote to 16 of 18 workers and always entail a copy. > > On Mon, Oct 26, 2015 at 9:12 AM, Jinfeng Li <liji...@gmail.com> wrote: > >> I cat /proc/net/dev and then take the difference of received bytes before >> and after the job. I also see a long-time peak (nearly 600Mb/s) in nload >> interface. We have 18 machines and each machine receives 4.7G bytes. >> >> On Mon, Oct 26, 2015 at 5:00 PM Sean Owen <so...@cloudera.com> wrote: >> >>> -dev +user >>> How are you measuring network traffic? >>> It's not in general true that there will be zero network traffic, since >>> not all executors are local to all data. That can be the situation in many >>> cases but not always. >>> >>> On Mon, Oct 26, 2015 at 8:57 AM, Jinfeng Li <liji...@gmail.com> wrote: >>> >>>> Hi, I find that loading files from HDFS can incur huge amount of >>>> network traffic. Input size is 90G and network traffic is about 80G. By my >>>> understanding, local files should be read and thus no network communication >>>> is needed. >>>> >>>> I use Spark 1.5.1, and the following is my code: >>>> >>>> val textRDD = sc.textFile("hdfs://master:9000/inputDir") >>>> textRDD.count >>>> >>>> Jeffrey >>>> >>> >>> >