I can add 'network file system gone AWOL on a node' to the list of
common causes, I think...
Tina
On 18/05/15 15:03, Skylar Thompson wrote:
That's been our experience too, with the second highest cause a segfault in
the user's code.
You can figure out for sure by looking at the exec daemon's messages file.
On Mon, May 18, 2015 at 02:52:15PM +0200, Nicols Serrano Martnez-Santos wrote:
It can be caused by multiple issues. The most common cause in my department is
that HDD of the execution host is full, so Grid Engine put the host in error to
prevent more errors.
NiCo
Excerpts from sudha.penmetsa's message of 2015-05-18 14:45:48 +0200:
Hi Gavin,
I clear the error state using qmod -c "*".
Wanted to know the root cause and the solution to fix the issue permanently.
Regards,
Sudha
-----Original Message-----
From: Gavin W. Burris [mailto:b...@wharton.upenn.edu]
Sent: Monday, May 18, 2015 6:08 PM
To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
Cc: users@gridengine.org
Subject: Re: [gridengine users] Grid queue goes into an error state due to one
job
Hello, Sudha.
Give this a try: qmod -c "*"
Cheers.
On 10:51AM Mon 05/18/15 +0000, sudha.penme...@wipro.com wrote:
Hi,
We have few hosts added to a queue. Due to one single job submitted to the
queue the whole queue goes into error state.
As a result, no new jobs can be submitted to the queue unless we clear the
error state.
Can anyone please let me know what could be the reason for this and how to fix
it permanently.
Ex
test.q@host1 BIP 7/40 10.86 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure
at host host1
---------------------------------------------------------------------------
test.q@host2 BIP 7/40 10.74 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure
at host host2
----------------------------------------------------------------------------
test.q@host3 BIP 10/40 10.73 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure
at host host3
----------------------------------------------------------------------------
test.q@host4 BIP 8/40 11.28 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure
at host host4
----------------------------------------------------------------------------
test.q@host5 BIP 7/40 11.52 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure
at host host5
----------------------------------------------------------------------------
test.q@host6 BIP 8/40 10.41 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure
at host host6
Regards,
Sudha
The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any
attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus transmitted by this email. www.wipro.com
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
--
Gavin W. Burris
Senior Project Leader for Research Computing The Wharton School University of
Pennsylvania
The information contained in this electronic message and any attachments to
this message are intended for the exclusive use of the addressee(s) and may
contain proprietary, confidential or privileged information. If you are not the
intended recipient, you should not disseminate, distribute or copy this e-mail.
Please notify the sender immediately and destroy all copies of this message and
any attachments. WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the presence of
viruses. The company accepts no liability for any damage caused by any virus
transmitted by this email. www.wipro.com
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or
privileged material, and are for the use of the intended addressee only. If you
are not the intended addressee or an authorised recipient of the addressee
please notify us of receipt by returning the e-mail and do not use, copy,
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and
Wales with its registered office at Diamond House, Harwell Science and
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users