Hello all,

Hopefully, someone may have a plausible explanation as to why we're
seeing some time stamp inconsistencies within our ARCO db (postgres).

It has been discovered that some j_id's within arco/sge_job are
reporting a 'j_submission_time' of 1969-12-31 19:00:00 while other
j_id's for the same job are reporting expected and current values.

Please see a sanitized output below with the UNIX epoch values:

  j_id   | j_job_number | j_task_number |  j_pe_taskid  | j_job_name |
j_group | j_owner  | j_account | j_priority |  j_submission_time  |
j_project | j_department
---------+--------------+---------------+---------------+------------+---------+----------+-----------+------------+---------------------+-----------+--------------
 7356583 |      2906834 |            -1 | 1.host-1  | Re95064    |
user | user | sge       |          0 | 1969-12-31 19:00:00 | NONE
| cefm
 7356104 |      2906834 |            -1 | 1.host-25 | Re95064    |
user | user | sge       |          0 | 1969-12-31 19:00:00 | NONE
| cefm
 7356103 |      2906834 |            -1 | 1.host-29 | Re95064    |
user | user | sge       |          0 | 1969-12-31 19:00:00 | NONE
| cefm
 7356101 |      2906834 |            -1 | 1.host-21 | Re95064    |
user | user | sge       |          0 | 1969-12-31 19:00:00 | NONE
| cefm
 7356096 |      2906834 |            -1 | 1.host-27 | Re95064    |
user | user | sge       |          0 | 1969-12-31 19:00:00 | NONE
| cefm
 7356062 |      2906834 |            -1 | 1.host-8  | Re95064    |
user | user | sge       |          0 | 1969-12-31 19:00:00 | NONE
| cefm
 7356052 |      2906834 |            -1 | 1.host-3  | Re95064    |
user | user | sge       |          0 | 1969-12-31 19:00:00 | NONE
| cefm

Please see a sanitized output from the same job with expected time stamps:

 j_id   | j_job_number | j_task_number |  j_pe_taskid  | j_job_name |
j_group | j_owner  | j_account | j_priority |  j_submission_time  |
j_project | j_department
---------+--------------+---------------+---------------+------------+---------+----------+-----------+------------+---------------------+-----------+--------------
 7395559 |      2906834 |            -1 | 1.host-3  | Re95064    |
user | user | sge       |          5 | 2014-09-29 11:45:09 | NONE
| cefm
 7395560 |      2906834 |            -1 | 1.host-8  | Re95064    |
user | user | sge       |          5 | 2014-09-29 11:45:09 | NONE
| cefm
 7395561 |      2906834 |            -1 | 1.host-27 | Re95064    |
user | user | sge       |          5 | 2014-09-29 11:45:09 | NONE
| cefm
 7395562 |      2906834 |            -1 | 1.host-1  | Re95064    |
user | user | sge       |          5 | 2014-09-29 11:45:09 | NONE
| cefm
 7395563 |      2906834 |            -1 | 1.host-21 | Re95064    |
user | user | sge       |          5 | 2014-09-29 11:45:09 | NONE
| cefm

Initially, I was thinking it had to do with either the time being out
of sync or with sge_execd being restarted on the hosts in question.
However, I did some testing and found that those were just red
herrings.  I checked qmaster and it has been stable for quite some
time in terms of clock sync and uptime of the qmaster process.

A site 
(https://www.gc3.uzh.ch/blog/GridEngine_accounting_queries_with_PostgreSQL/)
suggested that jobs which have failed may manifest the NULL time stamp
value, but my tests deliberately failed by wallclock and calling
commands which didn't exist.

Has anyone else seen this type of time stamp inconsistency in their
ARCO installations before?  If so, does anyone have a plausible idea
as to why it happens?

Thank you,
John DeSantis
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to