Reuti I currently have an open ticket with the machines vendor. I have pointed to a hardware issue from the first "auto-reboot" in my opinion. The /var/log/messages file seems to confirm my theory. I wanted to clear the Grid's perfect track record.
I have limited access to this affected server. There are a two things I want to verify physically, all power plugs are connected and no outright visible connections problems (damaged switch port, damaged power cables). At this point I don't want to open the chassis just because of the open vendor ticket. I feel it is the mobo or the bois this is all based on the /var/log/messages and my best tool cluster tool in the world SGE! I am glad you guys solved your power issue, that was a great solution. With the growing use of GPUs instead of CPUs, I too have been faced with power issue. My final resolve was to split the three "power hogs" server over 5 different circuits! Thanks Reuti! -----Original Message----- From: Reuti [mailto:re...@staff.uni-marburg.de] Sent: Tuesday, November 29, 2016 2:10 PM To: Coleman, Marcus [JRDUS Non-J&J] Cc: users@gridengine.org Subject: [EXTERNAL] Re: [gridengine users] commlib Am 29.11.2016 um 20:08 schrieb Coleman, Marcus [JRDUS Non-J&J]: > Reuti Thanks for the information!!! > > Any idea on what is causing the reboot? There are several possibilities: - oom-killer (less likely when there are no jobs on the node) - uncorrectable ECC-error - heat-problem due to a the die detaching from the heat spreader inside the CPU - unreliable power supply - peaks/outage on the mains and the machine set to boot after power fail - other problems on the mainboard like broken up capacitors which can be spot by a swelling on their top and potentially some brown spots thereon Is there anything mentioned in /var/log/messages just before the reboot? Once we faced in a cluster due to (most likely) construction work in the neighborhood that: - some nodes were frozen - some nodes rebooted - some nodes were shut down - some nodes survived from time to time. We should have used the node ids to play these numbers in a lottery. Essentially we bought an on-line UPS with a short retention time of 5 minutes, but its main purpose was to have all the time the AC/DC and DC/AC conversion to filter the mains. The problems went away. -- Reuti > -----Original Message----- > From: Reuti [mailto:re...@staff.uni-marburg.de] > Sent: Tuesday, November 29, 2016 6:02 AM > To: Coleman, Marcus [JRDUS Non-J&J] > Cc: users@gridengine.org > Subject: [EXTERNAL] Re: Re: [gridengine users] commlib > > >> Am 29.11.2016 um 00:17 schrieb Coleman, Marcus [JRDUS Non-J&J] >> <mcole...@its.jnj.com>: >> >> Reuti >> >> So it rebooted again without any jobs running...and I don't understand " >> sgead...@rndusljpp2.na.jnj.com removed "mcolem19" from user list" but as you >> see I got added back ??? > > Yes, there is a auto delete time for users which were added automatically due > to a job submission. > > $ qconf -suser mcolem19 > > will show when the next deletion will take place (unless you set it to 0). > > $ qconf -suserl > > shows all currently known users. > > -- Reuti > >> >> 11/27/2016 01:30:04| >> timer|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com >> removed "mcolem19" from user list >> 11/27/2016 01:30:04| >> timer|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com >> removed "mcolem19" from user list >> 11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: endpoint is >> not unique error (endpoint "padme/execd/1" is already connected) >> 11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: got select >> error (Connection reset by peer) >> 11/27/2016 20:35:13|worker|rndusljpp2|I|execd on padme registered >> 11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: endpoint is >> not unique error (endpoint "padme/execd/1" is already connected) >> 11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: got select >> error (Connection reset by peer) >> 11/28/2016 06:26:20|worker|rndusljpp2|I|execd on padme registered >> 11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: endpoint is >> not unique error (endpoint "padme/execd/1" is already connected) >> 11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: got select >> error (Connection reset by peer) >> 11/28/2016 08:49:52|worker|rndusljpp2|I|execd on padme registered >> 11/28/2016 >> 13:25:54|worker|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com >> added "mcolem19" to user list >> >> -----Original Message----- >> From: Reuti [mailto:re...@staff.uni-marburg.de] >> Sent: Monday, November 28, 2016 11:55 AM >> To: Coleman, Marcus [JRDUS Non-J&J] >> Cc: users@gridengine.org >> Subject: [EXTERNAL] Re: [gridengine users] commlib >> >> >> Am 28.11.2016 um 20:36 schrieb Coleman, Marcus [JRDUS Non-J&J]: >> >>> Thanks Reuti! >>> >>> I was hoping it was something there....Any ideas on where to go from here? >> >> What do: >> >> $ ./gethostbyname -all padme >> $ ./gethostbyaddr -all 192.168.1.159 >> >> show on the node and headnode? >> >> -- Reuti >> >> >>> -----Original Message----- >>> From: Reuti [mailto:re...@staff.uni-marburg.de] >>> Sent: Sunday, November 27, 2016 4:37 AM >>> To: Coleman, Marcus [JRDUS Non-J&J] >>> Cc: users@gridengine.org >>> Subject: [EXTERNAL] Re: [gridengine users] commlib >>> >>> >>> Am 27.11.2016 um 03:23 schrieb Coleman, Marcus [JRDUS Non-J&J]: >>> >>>> Hi Reuti >>>> >>>> I am not sure what I am looking for...but here is the contents of >>>> /tmp on the rebooting node Any outrights you can see? >>>> >>>> [root@padme tmp]# ls -l >>>> total 20 >>>> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:09 jmonitor.mcolem19.37995 >>>> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:35 jmonitor.mcolem19.38497 >>>> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:45 jmonitor.mcolem19.38615 >>>> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:45 jmonitor.mcolem19.38624 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.28331 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.28377 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:40 jmonitor.schrogpu.31781 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:41 jmonitor.schrogpu.31829 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 9 12:17 jmonitor.schrogpu.5042 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 9 12:17 jmonitor.schrogpu.5043 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:08 jmonitor.schrogpu.8041 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:39 jmonitor.schrogpu.8220 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:26 jmonitor.schrogpu.8346 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:39 jmonitor.schrogpu.8557 >>>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.8740 >>>> drwx------ 2 root root 4096 Nov 4 16:09 keyring-6CWKlB >>>> drwxrwxrwx 2 mcolem19 mcolem19 4096 Nov 23 11:03 mmjob.lock >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28352 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28400 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28480 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28487 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.31802 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.31850 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:40 mmjob.schrogpu.31876 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:41 mmjob.schrogpu.31891 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:08 mmjob.schrogpu.8087 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.8266 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:26 mmjob.schrogpu.8392 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.8603 >>>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.8787 >>>> drwx------ 2 gdm gdm 4096 Nov 25 07:42 orbit-gdm >>>> drwx------. 2 gdm gdm 4096 Nov 25 07:42 pulse-5mlDwNemaGym >>>> drwx------ 2 root root 4096 Nov 4 16:09 pulse-GAI9xhuCTgeg >>> >>> Thx, I was looking for a file created by the execd in case it faces >>> problems during startup. Such files will be saved in /tmp as last resort >>> for the logfiles. Unfortunately there are none, hence the startup per se >>> was successful. >>> >>> >>>> [root@padme tmp]# >>>> >>>> >>>> -----Original Message----- >>>> From: Reuti [mailto:re...@staff.uni-marburg.de] >>>> Sent: Saturday, November 26, 2016 6:31 AM >>>> To: Coleman, Marcus [JRDUS Non-J&J] >>>> Cc: users@gridengine.org >>>> Subject: [EXTERNAL] Re: [gridengine users] commlib >>>> >>>> Hi, >>>> >>>> Am 26.11.2016 um 06:10 schrieb Coleman, Marcus [JRDUS Non-J&J]: >>>> >>>>> I am having an issue with a node rebooting. I am running Desmond >>>>> fep jobs... >>>>> >>>>> Thanks for any help in advance! >>>>> >>>>> /etc/resolv.conf is the same on all nodes /etc/hosts is the same >>>>> on all nodes All nodes are connected to the same switch in a server rack. >>>>> ################### from NODE >>>>> [root@padme lx-amd64]# ./gethostbyaddr -name 192.168.1.8 >>>>> rndusljpp2.na.jnj.com [root@padme lx-amd64]# ./gethostbyname -name >>>>> s1 rndusljpp2.na.jnj.com ################### from QMASTER >>>>> [root@rndusljpp2 lx-amd64]# ./gethostbyaddr -name 192.168.1.159 >>>>> padme >>>>> [root@rndusljpp2 lx-amd64]# ./gethostbyname -name padme padme >>> >>> What do: >>> >>> $ ./gethostbyname -all padme >>> $ ./gethostbyaddr -all 192.168.1.159 >>> >>> show? >>> >>> -- Reuti >>> >> >> > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users