Re: Map-Reduce Slow Down

Mithila Nagendra Wed, 15 Apr 2009 11:04:33 -0700

Hi Aaron
I will look into that thanks!

I spoke to the admin who overlooks the cluster. He said that the gateway
comes in to the picture only when one of the nodes communicates with a node
outside of the cluster. But in my case the communication is carried out
between the nodes which all belong to the same cluster.


Mithila

On Wed, Apr 15, 2009 at 8:59 PM, Aaron Kimball <[email protected]> wrote:

> Hi,
>
> I wrote a blog post a while back about connecting nodes via a gateway. See
> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
>
> This assumes that the client is outside the gateway and all
> datanodes/namenode are inside, but the same principles apply. You'll just
> need to set up ssh tunnels from every datanode to the namenode.
>
> - Aaron
>
>
> On Wed, Apr 15, 2009 at 10:19 AM, Ravi Phulari <[email protected]>wrote:
>
>> Looks like your NameNode is down .
>> Verify if hadoop process are running (   jps should show you all java
>> running process).
>> If your hadoop process are running try restarting your hadoop process .
>> I guess this problem is due to your fsimage not being correct .
>> You might have to format your namenode.
>> Hope this helps.
>>
>> Thanks,
>> --
>> Ravi
>>
>>
>> On 4/15/09 10:15 AM, "Mithila Nagendra" <[email protected]> wrote:
>>
>> The log file runs into thousands of line with the same message being
>> displayed every time.
>>
>> On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra <[email protected]>
>> wrote:
>>
>> > The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
>> > following in it:
>> >
>> > 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode:
>> STARTUP_MSG:
>> > /************************************************************
>> > STARTUP_MSG: Starting DataNode
>> > STARTUP_MSG:   host = node19/127.0.0.1
>> > STARTUP_MSG:   args = []
>> > STARTUP_MSG:   version = 0.18.3
>> > STARTUP_MSG:   build =
>> > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
>> > 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
>> > ************************************************************/
>> > 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
>> > 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
>> > 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
>> > 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
>> > 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
>> > 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
>> > 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
>> > 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
>> > 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 8 time(s).
>> > 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 9 time(s).
>> > 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at
>> node18/
>> > 192.168.0.18:54310 not available yet, Zzzzz...
>> > 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
>> > 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
>> > 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
>> > 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
>> > 2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
>> > 2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
>> > 2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
>> > 2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
>> > 2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 8 time(s).
>> > 2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 9 time(s).
>> > 2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at
>> node18/
>> > 192.168.0.18:54310 not available yet, Zzzzz...
>> > 2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
>> > 2009-04-14 10:08:36,145 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
>> > 2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect
>> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
>> >
>> >
>> > Hmmm I still cant figure it out..
>> >
>> > Mithila
>> >
>> >
>> > On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra <[email protected]
>> >wrote:
>> >
>> >> Also, Would the way the port is accessed change if all these node are
>> >> connected through a gateway? I mean in the hadoop-site.xml file? The
>> Ubuntu
>> >> systems we worked with earlier didnt have a gateway.
>> >> Mithila
>> >>
>> >> On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra <[email protected]
>> >wrote:
>> >>
>> >>> Aaron: Which log file do I look into - there are alot of them. Here s
>> >>> what the error looks like:
>> >>> [mith...@node19:~]$ cd hadoop
>> >>> [mith...@node19:~/hadoop]$ bin/hadoop dfs -ls
>> >>> 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
>> >>> 192.168.0.18:54310. Already tried 0 time(s).
>> >>> 09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server: node18/
>> >>> 192.168.0.18:54310. Already tried 1 time(s).
>> >>> 09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server: node18/
>> >>> 192.168.0.18:54310. Already tried 2 time(s).
>> >>> 09/04/14 10:09:32 INFO ipc.Client: Retrying connect to server: node18/
>> >>> 192.168.0.18:54310. Already tried 3 time(s).
>> >>> 09/04/14 10:09:33 INFO ipc.Client: Retrying connect to server: node18/
>> >>> 192.168.0.18:54310. Already tried 4 time(s).
>> >>> 09/04/14 10:09:34 INFO ipc.Client: Retrying connect to server: node18/
>> >>> 192.168.0.18:54310. Already tried 5 time(s).
>> >>> 09/04/14 10:09:35 INFO ipc.Client: Retrying connect to server: node18/
>> >>> 192.168.0.18:54310. Already tried 6 time(s).
>> >>> 09/04/14 10:09:36 INFO ipc.Client: Retrying connect to server: node18/
>> >>> 192.168.0.18:54310. Already tried 7 time(s).
>> >>> 09/04/14 10:09:37 INFO ipc.Client: Retrying connect to server: node18/
>> >>> 192.168.0.18:54310. Already tried 8 time(s).
>> >>> 09/04/14 10:09:38 INFO ipc.Client: Retrying connect to server: node18/
>> >>> 192.168.0.18:54310. Already tried 9 time(s).
>> >>> Bad connection to FS. command aborted.
>> >>>
>> >>> Node19 is a slave and Node18 is the master.
>> >>>
>> >>> Mithila
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Apr 14, 2009 at 8:53 PM, Aaron Kimball <[email protected]
>> >wrote:
>> >>>
>> >>>> Are there any error messages in the log files on those nodes?
>> >>>> - Aaron
>> >>>>
>> >>>> On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra <[email protected]>
>> >>>> wrote:
>> >>>>
>> >>>> > I ve drawn a blank here! Can't figure out what s wrong with the
>> ports.
>> >>>> I
>> >>>> > can
>> >>>> > ssh between the nodes but cant access the DFS from the slaves -
>> says
>> >>>> "Bad
>> >>>> > connection to DFS". Master seems to be fine.
>> >>>> > Mithila
>> >>>> >
>> >>>> > On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra <
>> [email protected]>
>> >>>> > wrote:
>> >>>> >
>> >>>> > > Yes I can..
>> >>>> > >
>> >>>> > >
>> >>>> > > On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky <
>> [email protected]
>> >>>> > >wrote:
>> >>>> > >
>> >>>> > >> Can you ssh between the nodes?
>> >>>> > >>
>> >>>> > >> -jim
>> >>>> > >>
>> >>>> > >> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra <
>> >>>> [email protected]>
>> >>>> > >> wrote:
>> >>>> > >>
>> >>>> > >> > Thanks Aaron.
>> >>>> > >> > Jim: The three clusters I setup had ubuntu running on them and
>> >>>> the dfs
>> >>>> > >> was
>> >>>> > >> > accessed at port 54310. The new cluster which I ve setup has
>> Red
>> >>>> Hat
>> >>>> > >> Linux
>> >>>> > >> > release 7.2 (Enigma)running on it. Now when I try to access
>> the
>> >>>> dfs
>> >>>> > from
>> >>>> > >> > one
>> >>>> > >> > of the slaves i get the following response: dfs cannot be
>> >>>> accessed.
>> >>>> > When
>> >>>> > >> I
>> >>>> > >> > access the DFS throught the master there s no problem. So I
>> feel
>> >>>> there
>> >>>> > a
>> >>>> > >> > problem with the port. Any ideas? I did check the list of
>> slaves,
>> >>>> it
>> >>>> > >> looks
>> >>>> > >> > fine to me.
>> >>>> > >> >
>> >>>> > >> > Mithila
>> >>>> > >> >
>> >>>> > >> >
>> >>>> > >> >
>> >>>> > >> >
>> >>>> > >> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky <
>> >>>> [email protected]>
>> >>>> > >> > wrote:
>> >>>> > >> >
>> >>>> > >> > > Mithila,
>> >>>> > >> > >
>> >>>> > >> > > You said all the slaves were being utilized in the 3 node
>> >>>> cluster.
>> >>>> > >> Which
>> >>>> > >> > > application did you run to test that and what was your input
>> >>>> size?
>> >>>> > If
>> >>>> > >> you
>> >>>> > >> > > tried the word count application on a 516 MB input file on
>> both
>> >>>> > >> cluster
>> >>>> > >> > > setups, than some of your nodes in the 15 node cluster may
>> not
>> >>>> be
>> >>>> > >> running
>> >>>> > >> > > at
>> >>>> > >> > > all. Generally, one map job is assigned to each input split
>> and
>> >>>> if
>> >>>> > you
>> >>>> > >> > are
>> >>>> > >> > > running your cluster with the defaults, the splits are 64 MB
>> >>>> each. I
>> >>>> > >> got
>> >>>> > >> > > confused when you said the Namenode seemed to do all the
>> work.
>> >>>> Can
>> >>>> > you
>> >>>> > >> > > check
>> >>>> > >> > > conf/slaves and make sure you put the names of all task
>> >>>> trackers
>> >>>> > >> there? I
>> >>>> > >> > > also suggest comparing both clusters with a larger input
>> size,
>> >>>> say
>> >>>> > at
>> >>>> > >> > least
>> >>>> > >> > > 5 GB, to really see a difference.
>> >>>> > >> > >
>> >>>> > >> > > Jim
>> >>>> > >> > >
>> >>>> > >> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball <
>> >>>> [email protected]>
>> >>>> > >> > wrote:
>> >>>> > >> > >
>> >>>> > >> > > > in hadoop-*-examples.jar, use "randomwriter" to generate
>> the
>> >>>> data
>> >>>> > >> and
>> >>>> > >> > > > "sort"
>> >>>> > >> > > > to sort it.
>> >>>> > >> > > > - Aaron
>> >>>> > >> > > >
>> >>>> > >> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi <
>> >>>> > [email protected]>
>> >>>> > >> > > wrote:
>> >>>> > >> > > >
>> >>>> > >> > > > > Your data is too small I guess for 15 clusters ..So it
>> >>>> might be
>> >>>> > >> > > overhead
>> >>>> > >> > > > > time of these clusters making your total MR jobs more
>> time
>> >>>> > >> consuming.
>> >>>> > >> > > > > I guess you will have to try with larger set of data..
>> >>>> > >> > > > >
>> >>>> > >> > > > > Pankil
>> >>>> > >> > > > > On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra <
>> >>>> > >> [email protected]>
>> >>>> > >> > > > > wrote:
>> >>>> > >> > > > >
>> >>>> > >> > > > > > Aaron
>> >>>> > >> > > > > >
>> >>>> > >> > > > > > That could be the issue, my data is just 516MB -
>> wouldn't
>> >>>> this
>> >>>> > >> see
>> >>>> > >> > a
>> >>>> > >> > > > bit
>> >>>> > >> > > > > of
>> >>>> > >> > > > > > speed up?
>> >>>> > >> > > > > > Could you guide me to the example? I ll run my cluster
>> on
>> >>>> it
>> >>>> > and
>> >>>> > >> > see
>> >>>> > >> > > > what
>> >>>> > >> > > > > I
>> >>>> > >> > > > > > get. Also for my program I had a java timer running to
>> >>>> record
>> >>>> > >> the
>> >>>> > >> > > time
>> >>>> > >> > > > > > taken
>> >>>> > >> > > > > > to complete execution. Does Hadoop have an inbuilt
>> timer?
>> >>>> > >> > > > > >
>> >>>> > >> > > > > > Mithila
>> >>>> > >> > > > > >
>> >>>> > >> > > > > > On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball <
>> >>>> > >> [email protected]
>> >>>> > >> > >
>> >>>> > >> > > > > wrote:
>> >>>> > >> > > > > >
>> >>>> > >> > > > > > > Virtually none of the examples that ship with Hadoop
>> >>>> are
>> >>>> > >> designed
>> >>>> > >> > > to
>> >>>> > >> > > > > > > showcase its speed. Hadoop's speedup comes from its
>> >>>> ability
>> >>>> > to
>> >>>> > >> > > > process
>> >>>> > >> > > > > > very
>> >>>> > >> > > > > > > large volumes of data (starting around, say, tens of
>> GB
>> >>>> per
>> >>>> > >> job,
>> >>>> > >> > > and
>> >>>> > >> > > > > > going
>> >>>> > >> > > > > > > up in orders of magnitude from there). So if you are
>> >>>> timing
>> >>>> > >> the
>> >>>> > >> > pi
>> >>>> > >> > > > > > > calculator (or something like that), its results
>> won't
>> >>>> > >> > necessarily
>> >>>> > >> > > be
>> >>>> > >> > > > > > very
>> >>>> > >> > > > > > > consistent. If a job doesn't have enough fragments
>> of
>> >>>> data
>> >>>> > to
>> >>>> > >> > > > allocate
>> >>>> > >> > > > > > one
>> >>>> > >> > > > > > > per each node, some of the nodes will also just go
>> >>>> unused.
>> >>>> > >> > > > > > >
>> >>>> > >> > > > > > > The best example for you to run is to use
>> randomwriter
>> >>>> to
>> >>>> > fill
>> >>>> > >> up
>> >>>> > >> > > > your
>> >>>> > >> > > > > > > cluster with several GB of random data and then run
>> the
>> >>>> sort
>> >>>> > >> > > program.
>> >>>> > >> > > > > If
>> >>>> > >> > > > > > > that doesn't scale up performance from 3 nodes to
>> 15,
>> >>>> then
>> >>>> > >> you've
>> >>>> > >> > > > > > > definitely
>> >>>> > >> > > > > > > got something strange going on.
>> >>>> > >> > > > > > >
>> >>>> > >> > > > > > > - Aaron
>> >>>> > >> > > > > > >
>> >>>> > >> > > > > > >
>> >>>> > >> > > > > > > On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra <
>> >>>> > >> > > [email protected]>
>> >>>> > >> > > > > > > wrote:
>> >>>> > >> > > > > > >
>> >>>> > >> > > > > > > > Hey all
>> >>>> > >> > > > > > > > I recently setup a three node hadoop cluster and
>> ran
>> >>>> an
>> >>>> > >> > examples
>> >>>> > >> > > on
>> >>>> > >> > > > > it.
>> >>>> > >> > > > > > > It
>> >>>> > >> > > > > > > > was pretty fast, and all the three nodes were
>> being
>> >>>> used
>> >>>> > (I
>> >>>> > >> > > checked
>> >>>> > >> > > > > the
>> >>>> > >> > > > > > > log
>> >>>> > >> > > > > > > > files to make sure that the slaves are utilized).
>> >>>> > >> > > > > > > >
>> >>>> > >> > > > > > > > Now I ve setup another cluster consisting of 15
>> >>>> nodes. I
>> >>>> > ran
>> >>>> > >> > the
>> >>>> > >> > > > same
>> >>>> > >> > > > > > > > example, but instead of speeding up, the
>> map-reduce
>> >>>> task
>> >>>> > >> seems
>> >>>> > >> > to
>> >>>> > >> > > > > take
>> >>>> > >> > > > > > > > forever! The slaves are not being used for some
>> >>>> reason.
>> >>>> > This
>> >>>> > >> > > second
>> >>>> > >> > > > > > > cluster
>> >>>> > >> > > > > > > > has a lower, per node processing power, but should
>> >>>> that
>> >>>> > make
>> >>>> > >> > any
>> >>>> > >> > > > > > > > difference?
>> >>>> > >> > > > > > > > How can I ensure that the data is being mapped to
>> all
>> >>>> the
>> >>>> > >> > nodes?
>> >>>> > >> > > > > > > Presently,
>> >>>> > >> > > > > > > > the only node that seems to be doing all the work
>> is
>> >>>> the
>> >>>> > >> Master
>> >>>> > >> > > > node.
>> >>>> > >> > > > > > > >
>> >>>> > >> > > > > > > > Does 15 nodes in a cluster increase the network
>> cost?
>> >>>> What
>> >>>> > >> can
>> >>>> > >> > I
>> >>>> > >> > > do
>> >>>> > >> > > > > to
>> >>>> > >> > > > > > > > setup
>> >>>> > >> > > > > > > > the cluster to function more efficiently?
>> >>>> > >> > > > > > > >
>> >>>> > >> > > > > > > > Thanks!
>> >>>> > >> > > > > > > > Mithila Nagendra
>> >>>> > >> > > > > > > > Arizona State University
>> >>>> > >> > > > > > > >
>> >>>> > >> > > > > > >
>> >>>> > >> > > > > >
>> >>>> > >> > > > >
>> >>>> > >> > > >
>> >>>> > >> > >
>> >>>> > >> >
>> >>>> > >>
>> >>>> > >
>> >>>> > >
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>>
>>
>> Ravi
>> --
>>
>>
>

Re: Map-Reduce Slow Down

Reply via email to