[slurm-dev] Re: Fixing corrupted slurm accounting?

Douglas Jacobsen Sat, 28 Oct 2017 09:34:22 -0700

A more complete response would be something like:

MariaDB [slurm_acct_db]> select * from <cluster>_last_ran_table;
+---------------+--------------+----------------+
| hourly_rollup | daily_rollup | monthly_rollup |
+---------------+--------------+----------------+
|    1509206400 |   1509174000 |     1506841200 |
+---------------+--------------+----------------+
1 row in set (0.00 sec)


MariaDB [slurm_acct_db]> update <cluster>_last_ran_table set
hourly_rollup=UNIX_TIMESTAMP('2017-01-01
00:00:00'),daily_rollup=UNIX_TIMESTAMP('2017-01-01
00:00:00'),monthly_rollup=UNIX_TIMESTAMP('2017-01-01 00:00:00');
Query OK, 1 row affected (0.05 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [alva_slurm_acct_db]> select * from <cluster>_last_ran_table;
+---------------+--------------+----------------+
| hourly_rollup | daily_rollup | monthly_rollup |
+---------------+--------------+----------------+
|    1483257600 |   1483257600 |     1483257600 |
+---------------+--------------+----------------+
1 row in set (0.01 sec)

MariaDB [slurm_acct_db]> quit

Making changes to the timestamps and "<cluster>" as appropriate.

Obviously mucking with the database is dangerous, so be careful.

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
[email protected]

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Sat, Oct 28, 2017 at 9:17 AM, Douglas Jacobsen <[email protected]>
wrote:

> Once you've got the end times fixed, youll need to manually update the
> timestamps in the <cluster>_last_ran table to some time point before the
> start of the earliest job fixed.  Then on the next hour mark, it'll start
> rerolling up the past data to reflect the new reality you've set in the
> database.
>
> Unfortunately I'm away from a keyboard right now so I'm not 100% certain
> of the table name.
>
> On Oct 28, 2017 09:09, "Doug Meyer" <[email protected]> wrote:
>
>> Look up orphan jobs and lost.pl (quick script to find orphans) in
>> https://groups.google.com/forum/#!forum/slurm-devel.
>>
>> Battling this myself right now.
>>
>> Thank you,
>> Doug
>>
>> On Fri, Oct 27, 2017 at 9:00 PM, Bill Broadley <[email protected]>
>> wrote:
>>
>>>
>>>
>>> I noticed crazy high numbers in my reports, things like sreport user top:
>>> Top 10 Users 2017-10-20T00:00:00 - 2017-10-26T23:59:59 (604800 secs)
>>> Use reported in Percentage of Total
>>> ------------------------------------------------------------
>>> --------------------
>>>   Cluster     Login     Proper Name         Account        Used   Energy
>>> ---------     --------- --------------- --------------- -----------
>>> --------
>>>     MyClust   JoeUser   Joe User         jgrp           3710.15%    0.00%
>>>
>>> This was during a period when JoeUser hadn't submitted a single job.
>>>
>>> We have been through some slurm upgrades, figured one of the schema
>>> tweaks had
>>> confused things.  I looked in the slurm accounting table and found the
>>> job_table.  I found 80,000 jobs with no end_time, that weren't actually
>>> running.
>>>  So I set the end_time = begin time for those 80,000 jobs.  It didn't
>>> help the
>>> reports.
>>>
>>> I then tried deleting all 80,000 jobs from the job_table and that didn't
>>> help
>>> either.
>>>
>>> Is there a way to rebuild the accounting data from the information in
>>> the job_
>>> table?
>>>
>>> Or any other suggestion for getting some sane numbers out?
>>>
>>
>>

[slurm-dev] Re: Fixing corrupted slurm accounting?

Reply via email to