Re: Map-Reduce Slow Down

Mithila Nagendra Wed, 15 Apr 2009 10:10:54 -0700

The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
following in it:


2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = node19/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.18.3
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 736250;
compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
************************************************************/
2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 0 time(s).
2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 1 time(s).
2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 2 time(s).
2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 3 time(s).
2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 4 time(s).
2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 5 time(s).
2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 6 time(s).
2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 7 time(s).
2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 8 time(s).
2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 9 time(s).
2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at node18/
192.168.0.18:54310 not available yet, Zzzzz...
2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 0 time(s).
2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 1 time(s).
2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 2 time(s).
2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 3 time(s).
2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 4 time(s).
2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 5 time(s).
2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 6 time(s).
2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 7 time(s).
2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 8 time(s).
2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 9 time(s).
2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at node18/
192.168.0.18:54310 not available yet, Zzzzz...
2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 0 time(s).
2009-04-14 10:08:36,145 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 1 time(s).
2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: node18/192.168.0.18:54310. Already tried 2 time(s).


Hmmm I still cant figure it out..

Mithila


On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra <[email protected]> wrote:

> Also, Would the way the port is accessed change if all these node are
> connected through a gateway? I mean in the hadoop-site.xml file? The Ubuntu
> systems we worked with earlier didnt have a gateway.
> Mithila
>
> On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra <[email protected]>wrote:
>
>> Aaron: Which log file do I look into - there are alot of them. Here s what
>> the error looks like:
>> [mith...@node19:~]$ cd hadoop
>> [mith...@node19:~/hadoop]$ bin/hadoop dfs -ls
>> 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 0 time(s).
>> 09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 1 time(s).
>> 09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 2 time(s).
>> 09/04/14 10:09:32 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 3 time(s).
>> 09/04/14 10:09:33 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 4 time(s).
>> 09/04/14 10:09:34 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 5 time(s).
>> 09/04/14 10:09:35 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 6 time(s).
>> 09/04/14 10:09:36 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 7 time(s).
>> 09/04/14 10:09:37 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 8 time(s).
>> 09/04/14 10:09:38 INFO ipc.Client: Retrying connect to server: node18/
>> 192.168.0.18:54310. Already tried 9 time(s).
>> Bad connection to FS. command aborted.
>>
>> Node19 is a slave and Node18 is the master.
>>
>> Mithila
>>
>>
>>
>> On Tue, Apr 14, 2009 at 8:53 PM, Aaron Kimball <[email protected]>wrote:
>>
>>> Are there any error messages in the log files on those nodes?
>>> - Aaron
>>>
>>> On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra <[email protected]>
>>> wrote:
>>>
>>> > I ve drawn a blank here! Can't figure out what s wrong with the ports.
>>> I
>>> > can
>>> > ssh between the nodes but cant access the DFS from the slaves - says
>>> "Bad
>>> > connection to DFS". Master seems to be fine.
>>> > Mithila
>>> >
>>> > On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra <[email protected]>
>>> > wrote:
>>> >
>>> > > Yes I can..
>>> > >
>>> > >
>>> > > On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky <[email protected]
>>> > >wrote:
>>> > >
>>> > >> Can you ssh between the nodes?
>>> > >>
>>> > >> -jim
>>> > >>
>>> > >> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra <[email protected]
>>> >
>>> > >> wrote:
>>> > >>
>>> > >> > Thanks Aaron.
>>> > >> > Jim: The three clusters I setup had ubuntu running on them and the
>>> dfs
>>> > >> was
>>> > >> > accessed at port 54310. The new cluster which I ve setup has Red
>>> Hat
>>> > >> Linux
>>> > >> > release 7.2 (Enigma)running on it. Now when I try to access the
>>> dfs
>>> > from
>>> > >> > one
>>> > >> > of the slaves i get the following response: dfs cannot be
>>> accessed.
>>> > When
>>> > >> I
>>> > >> > access the DFS throught the master there s no problem. So I feel
>>> there
>>> > a
>>> > >> > problem with the port. Any ideas? I did check the list of slaves,
>>> it
>>> > >> looks
>>> > >> > fine to me.
>>> > >> >
>>> > >> > Mithila
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky <
>>> [email protected]>
>>> > >> > wrote:
>>> > >> >
>>> > >> > > Mithila,
>>> > >> > >
>>> > >> > > You said all the slaves were being utilized in the 3 node
>>> cluster.
>>> > >> Which
>>> > >> > > application did you run to test that and what was your input
>>> size?
>>> > If
>>> > >> you
>>> > >> > > tried the word count application on a 516 MB input file on both
>>> > >> cluster
>>> > >> > > setups, than some of your nodes in the 15 node cluster may not
>>> be
>>> > >> running
>>> > >> > > at
>>> > >> > > all. Generally, one map job is assigned to each input split and
>>> if
>>> > you
>>> > >> > are
>>> > >> > > running your cluster with the defaults, the splits are 64 MB
>>> each. I
>>> > >> got
>>> > >> > > confused when you said the Namenode seemed to do all the work.
>>> Can
>>> > you
>>> > >> > > check
>>> > >> > > conf/slaves and make sure you put the names of all task trackers
>>> > >> there? I
>>> > >> > > also suggest comparing both clusters with a larger input size,
>>> say
>>> > at
>>> > >> > least
>>> > >> > > 5 GB, to really see a difference.
>>> > >> > >
>>> > >> > > Jim
>>> > >> > >
>>> > >> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball <
>>> [email protected]>
>>> > >> > wrote:
>>> > >> > >
>>> > >> > > > in hadoop-*-examples.jar, use "randomwriter" to generate the
>>> data
>>> > >> and
>>> > >> > > > "sort"
>>> > >> > > > to sort it.
>>> > >> > > > - Aaron
>>> > >> > > >
>>> > >> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi <
>>> > [email protected]>
>>> > >> > > wrote:
>>> > >> > > >
>>> > >> > > > > Your data is too small I guess for 15 clusters ..So it might
>>> be
>>> > >> > > overhead
>>> > >> > > > > time of these clusters making your total MR jobs more time
>>> > >> consuming.
>>> > >> > > > > I guess you will have to try with larger set of data..
>>> > >> > > > >
>>> > >> > > > > Pankil
>>> > >> > > > > On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra <
>>> > >> [email protected]>
>>> > >> > > > > wrote:
>>> > >> > > > >
>>> > >> > > > > > Aaron
>>> > >> > > > > >
>>> > >> > > > > > That could be the issue, my data is just 516MB - wouldn't
>>> this
>>> > >> see
>>> > >> > a
>>> > >> > > > bit
>>> > >> > > > > of
>>> > >> > > > > > speed up?
>>> > >> > > > > > Could you guide me to the example? I ll run my cluster on
>>> it
>>> > and
>>> > >> > see
>>> > >> > > > what
>>> > >> > > > > I
>>> > >> > > > > > get. Also for my program I had a java timer running to
>>> record
>>> > >> the
>>> > >> > > time
>>> > >> > > > > > taken
>>> > >> > > > > > to complete execution. Does Hadoop have an inbuilt timer?
>>> > >> > > > > >
>>> > >> > > > > > Mithila
>>> > >> > > > > >
>>> > >> > > > > > On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball <
>>> > >> [email protected]
>>> > >> > >
>>> > >> > > > > wrote:
>>> > >> > > > > >
>>> > >> > > > > > > Virtually none of the examples that ship with Hadoop are
>>> > >> designed
>>> > >> > > to
>>> > >> > > > > > > showcase its speed. Hadoop's speedup comes from its
>>> ability
>>> > to
>>> > >> > > > process
>>> > >> > > > > > very
>>> > >> > > > > > > large volumes of data (starting around, say, tens of GB
>>> per
>>> > >> job,
>>> > >> > > and
>>> > >> > > > > > going
>>> > >> > > > > > > up in orders of magnitude from there). So if you are
>>> timing
>>> > >> the
>>> > >> > pi
>>> > >> > > > > > > calculator (or something like that), its results won't
>>> > >> > necessarily
>>> > >> > > be
>>> > >> > > > > > very
>>> > >> > > > > > > consistent. If a job doesn't have enough fragments of
>>> data
>>> > to
>>> > >> > > > allocate
>>> > >> > > > > > one
>>> > >> > > > > > > per each node, some of the nodes will also just go
>>> unused.
>>> > >> > > > > > >
>>> > >> > > > > > > The best example for you to run is to use randomwriter
>>> to
>>> > fill
>>> > >> up
>>> > >> > > > your
>>> > >> > > > > > > cluster with several GB of random data and then run the
>>> sort
>>> > >> > > program.
>>> > >> > > > > If
>>> > >> > > > > > > that doesn't scale up performance from 3 nodes to 15,
>>> then
>>> > >> you've
>>> > >> > > > > > > definitely
>>> > >> > > > > > > got something strange going on.
>>> > >> > > > > > >
>>> > >> > > > > > > - Aaron
>>> > >> > > > > > >
>>> > >> > > > > > >
>>> > >> > > > > > > On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra <
>>> > >> > > [email protected]>
>>> > >> > > > > > > wrote:
>>> > >> > > > > > >
>>> > >> > > > > > > > Hey all
>>> > >> > > > > > > > I recently setup a three node hadoop cluster and ran
>>> an
>>> > >> > examples
>>> > >> > > on
>>> > >> > > > > it.
>>> > >> > > > > > > It
>>> > >> > > > > > > > was pretty fast, and all the three nodes were being
>>> used
>>> > (I
>>> > >> > > checked
>>> > >> > > > > the
>>> > >> > > > > > > log
>>> > >> > > > > > > > files to make sure that the slaves are utilized).
>>> > >> > > > > > > >
>>> > >> > > > > > > > Now I ve setup another cluster consisting of 15 nodes.
>>> I
>>> > ran
>>> > >> > the
>>> > >> > > > same
>>> > >> > > > > > > > example, but instead of speeding up, the map-reduce
>>> task
>>> > >> seems
>>> > >> > to
>>> > >> > > > > take
>>> > >> > > > > > > > forever! The slaves are not being used for some
>>> reason.
>>> > This
>>> > >> > > second
>>> > >> > > > > > > cluster
>>> > >> > > > > > > > has a lower, per node processing power, but should
>>> that
>>> > make
>>> > >> > any
>>> > >> > > > > > > > difference?
>>> > >> > > > > > > > How can I ensure that the data is being mapped to all
>>> the
>>> > >> > nodes?
>>> > >> > > > > > > Presently,
>>> > >> > > > > > > > the only node that seems to be doing all the work is
>>> the
>>> > >> Master
>>> > >> > > > node.
>>> > >> > > > > > > >
>>> > >> > > > > > > > Does 15 nodes in a cluster increase the network cost?
>>> What
>>> > >> can
>>> > >> > I
>>> > >> > > do
>>> > >> > > > > to
>>> > >> > > > > > > > setup
>>> > >> > > > > > > > the cluster to function more efficiently?
>>> > >> > > > > > > >
>>> > >> > > > > > > > Thanks!
>>> > >> > > > > > > > Mithila Nagendra
>>> > >> > > > > > > > Arizona State University
>>> > >> > > > > > > >
>>> > >> > > > > > >
>>> > >> > > > > >
>>> > >> > > > >
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Map-Reduce Slow Down

Reply via email to