Ok here is what I have
connected to one node compute010
qconf -sconf gives me this
#global:
execd_spool_dir /var/spool/gridengine/execd
mailer /usr/bin/mail
xterm /usr/bin/xterm
load_sensor none
prolog none
epilog none
shell_start_mode posix_compliant
login_shells bash,sh,ksh,csh,tcsh
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project false
enforce_user auto
load_report_time 00:00:40
max_unheard 00:05:00
reschedule_unknown 00:00:00
loglevel log_warning
administrator_mail root
set_token_cmd none
pag_cmd none
token_extend_time none
shepherd_cmd none
qmaster_params none
execd_params none
reporting_params accounting=true reporting=false \
flush_time=00:00:15 joblog=false
sharelog=00:00:00
finished_jobs 100
gid_range 65400-65500
max_aj_instances 2000
max_aj_tasks 75000
max_u_jobs 0
max_jobs 0
auto_user_oticket 0
auto_user_fshare 0
auto_user_default_project none
auto_user_delete_time 86400
delegated_file_staging false
reprioritize 0
rlogin_daemon /usr/sbin/sshd -i
rlogin_command /usr/bin/ssh
qlogin_daemon /usr/sbin/sshd -i
qlogin_command /usr/share/gridengine/qlogin-wrapper
rsh_daemon /usr/sbin/sshd -i
rsh_command /usr/bin/ssh
jsv_url none
jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w
the message in spool :
ubuntu@compute010:~$ more /var/spool/gridengine/execd/compute010/messages
05/02/2016 18:10:11| main|compute010|E|can't find connection
05/02/2016 18:10:11| main|compute010|E|can't get configuration from
qmaster -- backgrounding
05/04/2016 16:58:28| main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
05/18/2016 17:10:36| main|compute010|W|can't register at qmaster
"frontend001": abort qmaster registration due to communication errors
05/18/2016 17:37:55| main|compute010|I|controlled shutdown 6.2u5
05/18/2016 17:46:28| main|compute010|E|can't find connection
05/18/2016 17:46:28| main|compute010|E|can't get configuration from
qmaster -- backgrounding
05/18/2016 17:46:31| main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
05/20/2016 14:27:40| main|compute010|I|controlled shutdown 6.2u5
05/22/2016 17:00:28| main|compute010|E|can't find connection
05/22/2016 17:00:28| main|compute010|E|can't get configuration from
qmaster -- backgrounding
05/22/2016 17:01:38| main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
05/28/2016 03:59:31| main|compute010|I|controlled shutdown 6.2u5
05/28/2016 03:59:49| main|compute010|W|local configuration compute010 not
defined - using global configuration
05/28/2016 03:59:49| main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
05/30/2016 17:41:50| main|compute010|W|can't register at qmaster
"compute010": abort qmaster registration due to communication errors
05/30/2016 17:41:50| main|compute010|E|commlib error: got select error
(Connection refused)
05/30/2016 17:42:14| main|compute010|I|controlled shutdown 6.2u5
05/30/2016 17:58:58| main|compute010|W|local configuration compute010 not
defined - using global configuration
05/30/2016 17:58:58| main|compute010|I|starting up GE 6.2u5 (lx26-amd64)
I had the qmaster running on all nodes before, with no problem (master and
executors)
when I kill sge_master on the node, the sge_execd is not working anymore
because its not able to connect to the master
a ping on the node to the frontend node shows that it is visible though
:/
On Mon, May 30, 2016 at 11:14 AM, Bill Bryce <[email protected]> wrote:
> Okay,
>
> can you run any qconf commands such as ‘qconf -sconf’. Try having a look
> at the messages files for the execution daemons. They should be in
>
> $SGE_ROOT/default/spool/ and in there are directories for the master and
> exec hosts (if you have this installed in a shared filesystem
> envirionment). You can check both the qmaster messages file and the execd
> messages files in those directories.
>
> A question. Do you have the qmaster running on one host or on many? I
> noticed that you have the ps output for compute010 and it is running a
> qmaster.
>
> Other things you can check is to see if all nodes can contact the qmaster
> machine i.e. the networking is configured properly. You can also make sure
> that the host naming is correct, either configure DNS properly or configure
> a /etc/hosts file for all nodes so the IP to host name mapping is
> consistent across the cluster. Grid Engine is very picky about host names.
>
>
>
> On May 30, 2016, at 1:36 PM, Radhouane Aniba <[email protected]> wrote:
>
> Hi Bill
>
> Yes I am sure
>
> This is what I have when I login to one of the nodes and do
>
> ubuntu@compute010:~$ ps -ef | grep sge_
> sgeadmin 1254 1 0 May28 ? 00:00:39
> /usr/lib/gridengine/sge_qmaster
> sgeadmin 1446 1 0 May28 ? 00:00:22
> /usr/lib/gridengine/sge_execd
> ubuntu 2552 2527 0 17:36 pts/0 00:00:00 grep --color=auto sge_
>
>
> On Mon, May 30, 2016 at 10:33 AM, Bill Bryce <[email protected]> wrote:
>
>> Hi Rad,
>>
>> Are you sure that the execution daemons are running on your compute
>> nodes? Can you login to one of the nodes say ‘compute001’ and do a ps
>> looking for the execd? When an execd is functioning normally it provides
>> the load and memory, etc… none of your nodes are showing that.
>>
>> Regards,
>>
>> Bill.
>>
>> On May 30, 2016, at 1:20 PM, Radhouane Aniba <[email protected]> wrote:
>>
>> Hello all,
>>
>> I am trying to submit a simple "hello world" to test a gridengine (I used
>> it before with no problems)
>>
>> The problem is that my job is waiting in the queue forever
>>
>> The qhost command shows a wired state of the compute nodes
>>
>> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
>> SWAPUS
>> -------------------------------------------------------------------------------
>> global - - - - - -
>> -
>> compute001 lx26-amd64 4 - 31.4G - 0.0
>> -
>> compute002 lx26-amd64 4 - 31.4G - 0.0
>> -
>> compute003 lx26-amd64 4 - 31.4G - 0.0
>> -
>> compute004 lx26-amd64 4 - 31.4G - 0.0
>> -
>> compute005 lx26-amd64 4 - 31.4G - 0.0
>> -
>> compute006 lx26-amd64 4 - 31.4G - 0.0
>> -
>> compute007 lx26-amd64 4 - 31.4G - 0.0
>> -
>> compute008 lx26-amd64 4 - 31.4G - 0.0
>> -
>> compute009 lx26-amd64 4 - 31.4G - 0.0
>> -
>> compute010 lx26-amd64 4 - 31.4G - 0.0
>> -
>> compute011 lx26-amd64 4 - 31.4G - 0.0
>>
>> In normal times even when the compute nodes are not used I used to have
>> some information on the load and memuse columns
>>
>> I am not an SGE persons but I am familiar with all the commands, any help
>> would be much appreciated
>>
>> the qstat -f command shows all my nodes in au state. I've been reading a
>> lot about it and I understood its an alarm state (overloaded ?)
>>
>> the only heavy activity I had on the head node was a script downloading
>> 19T of data, could the headnode be the problem and not the compute nodes ?
>> sge_execd is working on all the compute/exec nodes :/
>>
>> --
>> *Rad*
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>>
>>
>> William Bryce | VP Products
>> Univa Corporation, Toronto
>> E: [email protected] | D: 647-9742841 | Toll-Free (800) 370-5320
>> W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation |
>> T: twitter.com/Grid_Engine
>>
>>
>
>
> --
> *Radhouane Aniba*
> *Bioinformatics Scientist*
> *BC Cancer Agency, Vancouver, Canada*
>
>
> William Bryce | VP Products
> Univa Corporation, Toronto
> E: [email protected] | D: 647-9742841 | Toll-Free (800) 370-5320
> W: Univa.com | FB: facebook.com/univa.corporation | T:
> twitter.com/Grid_Engine
>
>
--
*Radhouane Aniba*
*Bioinformatics Scientist*
*BC Cancer Agency, Vancouver, Canada*
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users