Re: [gridengine users] commlib

Coleman, Marcus [JRDUS Non-J&J] Fri, 02 Dec 2016 11:52:13 -0800

Reuti

I currently have an open ticket with the machines vendor. I have pointed to a 
hardware issue from the first "auto-reboot" in my opinion. The 
/var/log/messages file seems to confirm my theory. I wanted to clear the Grid's 
perfect track record.


I have limited access to this affected server. There are a two things I want to 
verify physically, all power plugs are connected and no outright visible 
connections problems (damaged switch port, damaged power cables). At this point 
I don't want to open the chassis just because of the open vendor ticket. I feel 
it is the mobo or the bois this is all based on the /var/log/messages and my 
best tool cluster tool in the world SGE!

I am glad you guys solved your power issue, that was a great solution. 
With the growing use of GPUs instead of CPUs, I too have been faced with power 
issue. My final resolve was to split the three "power hogs" server over 5 
different circuits!

Thanks Reuti!



-----Original Message-----
From: Reuti [mailto:re...@staff.uni-marburg.de] 
Sent: Tuesday, November 29, 2016 2:10 PM
To: Coleman, Marcus [JRDUS Non-J&J]
Cc: users@gridengine.org
Subject: [EXTERNAL] Re: [gridengine users] commlib


Am 29.11.2016 um 20:08 schrieb Coleman, Marcus [JRDUS Non-J&J]:

> Reuti Thanks for the information!!!
> 
> Any idea on what is causing the reboot? 

There are several possibilities:

- oom-killer (less likely when there are no jobs on the node)
- uncorrectable ECC-error
- heat-problem due to a the die detaching from the heat spreader inside the CPU
- unreliable power supply
- peaks/outage on the mains and the machine set to boot after power fail
- other problems on the mainboard like broken up capacitors which can be spot 
by a swelling on their top and potentially some brown spots thereon

Is there anything mentioned in /var/log/messages just before the reboot?

Once we faced in a cluster due to (most likely) construction work in the 
neighborhood that:

- some nodes were frozen
- some nodes rebooted
- some nodes were shut down
- some nodes survived

from time to time. We should have used the node ids to play these numbers in a 
lottery.

Essentially we bought an on-line UPS with a short retention time of 5 minutes, 
but its main purpose was to have all the time the AC/DC and DC/AC conversion to 
filter the mains. The problems went away.

-- Reuti


> -----Original Message-----
> From: Reuti [mailto:re...@staff.uni-marburg.de]
> Sent: Tuesday, November 29, 2016 6:02 AM
> To: Coleman, Marcus [JRDUS Non-J&J]
> Cc: users@gridengine.org
> Subject: [EXTERNAL] Re: Re: [gridengine users] commlib
> 
> 
>> Am 29.11.2016 um 00:17 schrieb Coleman, Marcus [JRDUS Non-J&J] 
>> <mcole...@its.jnj.com>:
>> 
>> Reuti
>> 
>> So it rebooted again without any jobs running...and I don't understand " 
>> sgead...@rndusljpp2.na.jnj.com removed "mcolem19" from user list" but as you 
>> see I got added back ???
> 
> Yes, there is a auto delete time for users which were added automatically due 
> to a job submission.
> 
> $ qconf -suser mcolem19
> 
> will show when the next deletion will take place (unless you set it to 0).
> 
> $ qconf -suserl
> 
> shows all currently known users.
> 
> -- Reuti
> 
>> 
>> 11/27/2016 01:30:04| 
>> timer|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com
>> removed "mcolem19" from user list
>> 11/27/2016 01:30:04| 
>> timer|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com
>> removed "mcolem19" from user list
>> 11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: endpoint is 
>> not unique error (endpoint "padme/execd/1" is already connected)
>> 11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: got select 
>> error (Connection reset by peer)
>> 11/27/2016 20:35:13|worker|rndusljpp2|I|execd on padme registered
>> 11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: endpoint is 
>> not unique error (endpoint "padme/execd/1" is already connected)
>> 11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: got select 
>> error (Connection reset by peer)
>> 11/28/2016 06:26:20|worker|rndusljpp2|I|execd on padme registered
>> 11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: endpoint is 
>> not unique error (endpoint "padme/execd/1" is already connected)
>> 11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: got select 
>> error (Connection reset by peer)
>> 11/28/2016 08:49:52|worker|rndusljpp2|I|execd on padme registered
>> 11/28/2016 
>> 13:25:54|worker|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com
>> added "mcolem19" to user list
>> 
>> -----Original Message-----
>> From: Reuti [mailto:re...@staff.uni-marburg.de]
>> Sent: Monday, November 28, 2016 11:55 AM
>> To: Coleman, Marcus [JRDUS Non-J&J]
>> Cc: users@gridengine.org
>> Subject: [EXTERNAL] Re: [gridengine users] commlib
>> 
>> 
>> Am 28.11.2016 um 20:36 schrieb Coleman, Marcus [JRDUS Non-J&J]:
>> 
>>> Thanks Reuti! 
>>> 
>>> I was hoping it was something there....Any ideas on where to go from here?
>> 
>> What do:
>> 
>> $ ./gethostbyname -all padme
>> $ ./gethostbyaddr -all 192.168.1.159
>> 
>> show on the node and headnode?
>> 
>> -- Reuti
>> 
>> 
>>> -----Original Message-----
>>> From: Reuti [mailto:re...@staff.uni-marburg.de]
>>> Sent: Sunday, November 27, 2016 4:37 AM
>>> To: Coleman, Marcus [JRDUS Non-J&J]
>>> Cc: users@gridengine.org
>>> Subject: [EXTERNAL] Re: [gridengine users] commlib
>>> 
>>> 
>>> Am 27.11.2016 um 03:23 schrieb Coleman, Marcus [JRDUS Non-J&J]:
>>> 
>>>> Hi Reuti
>>>> 
>>>> I am not sure what I am looking for...but here is the contents of 
>>>> /tmp on the rebooting node Any outrights you can see?
>>>> 
>>>> [root@padme tmp]# ls -l
>>>> total 20
>>>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:09 jmonitor.mcolem19.37995
>>>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:35 jmonitor.mcolem19.38497
>>>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:45 jmonitor.mcolem19.38615
>>>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:45 jmonitor.mcolem19.38624
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.28331
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.28377
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:40 jmonitor.schrogpu.31781
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:41 jmonitor.schrogpu.31829
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  9 12:17 jmonitor.schrogpu.5042
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  9 12:17 jmonitor.schrogpu.5043
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:08 jmonitor.schrogpu.8041
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:39 jmonitor.schrogpu.8220
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:26 jmonitor.schrogpu.8346
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:39 jmonitor.schrogpu.8557
>>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.8740
>>>> drwx------  2 root     root     4096 Nov  4 16:09 keyring-6CWKlB
>>>> drwxrwxrwx  2 mcolem19 mcolem19 4096 Nov 23 11:03 mmjob.lock
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28352
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28400
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28480
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28487
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.31802
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.31850
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:40 mmjob.schrogpu.31876
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:41 mmjob.schrogpu.31891
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:08 mmjob.schrogpu.8087
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.8266
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:26 mmjob.schrogpu.8392
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.8603
>>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.8787
>>>> drwx------  2 gdm      gdm      4096 Nov 25 07:42 orbit-gdm
>>>> drwx------. 2 gdm      gdm      4096 Nov 25 07:42 pulse-5mlDwNemaGym
>>>> drwx------  2 root     root     4096 Nov  4 16:09 pulse-GAI9xhuCTgeg
>>> 
>>> Thx, I was looking for a file created by the execd in case it faces 
>>> problems during startup. Such files will be saved in /tmp as last resort 
>>> for the logfiles. Unfortunately there are none, hence the startup per se 
>>> was successful.
>>> 
>>> 
>>>> [root@padme tmp]#
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Reuti [mailto:re...@staff.uni-marburg.de]
>>>> Sent: Saturday, November 26, 2016 6:31 AM
>>>> To: Coleman, Marcus [JRDUS Non-J&J]
>>>> Cc: users@gridengine.org
>>>> Subject: [EXTERNAL] Re: [gridengine users] commlib
>>>> 
>>>> Hi,
>>>> 
>>>> Am 26.11.2016 um 06:10 schrieb Coleman, Marcus [JRDUS Non-J&J]:
>>>> 
>>>>> I am having an issue with a node rebooting. I am running Desmond 
>>>>> fep jobs...
>>>>> 
>>>>> Thanks for any help in advance!
>>>>> 
>>>>> /etc/resolv.conf is the same on all nodes /etc/hosts is the same 
>>>>> on all nodes All nodes are connected to the same switch in a server rack.
>>>>> ################### from NODE
>>>>> [root@padme lx-amd64]# ./gethostbyaddr -name 192.168.1.8 
>>>>> rndusljpp2.na.jnj.com [root@padme lx-amd64]# ./gethostbyname -name
>>>>> s1 rndusljpp2.na.jnj.com ################### from QMASTER
>>>>> [root@rndusljpp2 lx-amd64]# ./gethostbyaddr -name 192.168.1.159 
>>>>> padme
>>>>> [root@rndusljpp2 lx-amd64]# ./gethostbyname -name padme padme
>>> 
>>> What do:
>>> 
>>> $ ./gethostbyname -all padme
>>> $ ./gethostbyaddr -all 192.168.1.159
>>> 
>>> show?
>>> 
>>> -- Reuti
>>> 
>> 
>> 
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] commlib

Reply via email to