Re: [slurm-users] enabling job script archival

Davide DelVento Wed, 04 Oct 2023 18:49:48 -0700

And weirdly enough it has now stopped working again, after I did the
experimentation for power save described in the other thread.
That is really strange. At the highest verbosity level the logs just say


slurmdbd: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:9984
UID:1457 IP:192.168.2.254 CONN:13

I reconfigured and reverted stuff to no change. Does anybody have any clue?

On Tue, Oct 3, 2023 at 5:43 PM Davide DelVento <davide.quan...@gmail.com>
wrote:

> For others potentially seeing this on mailing list search, yes, I needed
> that, which of course required creating an account charge which I wasn't
> using. So I ran
>
> sacctmgr add account default_account
> sacctmgr add -i user $user Accounts=default_account
>
> with an appropriate looping around for $user and everything is working
> fine now.
>
> Thanks everybody!
>
> On Tue, Oct 3, 2023 at 7:44 AM Paul Edmon <ped...@cfa.harvard.edu> wrote:
>
>> You will probably need to.
>>
>> The way we handle it is that we add users when the first submit a job via
>> the job_submit.lua script. This way the database autopopulates with active
>> users.
>>
>> -Paul Edmon-
>> On 10/3/23 9:01 AM, Davide DelVento wrote:
>>
>> By increasing the slurmdbd verbosity level, I got additional information,
>> namely the following:
>>
>> slurmdbd: error: couldn't get information for this user (null)(xxxxxx)
>> slurmdbd: debug: accounting_storage/as_mysql:
>> as_mysql_jobacct_process_get_jobs: User  xxxxxx  has no associations, and
>> is not admin, so not returning any jobs.
>>
>> again where xxxxx is the posix ID of the user who's running the query in
>> the slurmdbd logs.
>>
>> I suspect this is due to the fact that our userbase is small enough (we
>> are a department HPC) that we don't need to use allocation and the like, so
>> I have not configured any association (and not even studied its
>> configuration, since when I was at another place which did use
>> associations, someone else took care of slurm administration).
>>
>> Anyway, I read the fantastic document by our own member at
>> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associations
>> and in fact I have not even configured slurm users:
>>
>> # sacctmgr show user
>>       User   Def Acct     Admin
>> ---------- ---------- ---------
>>       root       root Administ+
>> #
>>
>> So is that the issue? Should I just add all users? Any suggestions on the
>> minimal (but robust) way to do that?
>>
>> Thanks!
>>
>>
>> On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento <davide.quan...@gmail.com>
>> wrote:
>>
>>> Thanks Paul, this helps.
>>>
>>> I don't have any PrivateData line in either config file. According to
>>> the docs, "By default, all information is visible to all users" so this
>>> should not be an issue. I tried to add a line with "PrivateData=jobs" to
>>> the conf files, just in case, but that didn't change the behavior.
>>>
>>> On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon <ped...@cfa.harvard.edu>
>>> wrote:
>>>
>>>> At least in our setup, users can see their own scripts by doing sacct
>>>> -B -j JOBID
>>>>
>>>> I would make sure that the scripts are being stored and how you have
>>>> PrivateData set.
>>>>
>>>> -Paul Edmon-
>>>> On 10/2/2023 10:57 AM, Davide DelVento wrote:
>>>>
>>>> I deployed the job_script archival and it is working, however it can be
>>>> queried only by root.
>>>>
>>>> A regular user can run sacct -lj towards any jobs (even those by other
>>>> users, and that's okay in our setup) with no problem. However if they run
>>>> sacct -j job_id --batch-script even against a job they own themselves,
>>>> nothing is returned and I get a
>>>>
>>>> slurmdbd: error: couldn't get information for this user (null)(xxxxxx)
>>>>
>>>> where xxxxx is the posix ID of the user who's running the query in the
>>>> slurmdbd logs.
>>>>
>>>> Both configure files slurmdbd.conf and slurm.conf do not have any
>>>> "permission" setting. FWIW, we use LDAP.
>>>>
>>>> Is that the expected behavior, in that by default only root can see the
>>>> job scripts? I was assuming the users themselves should be able to debug
>>>> their own jobs... Any hint on what could be changed to achieve this?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento <
>>>> davide.quan...@gmail.com> wrote:
>>>>
>>>>> Fantastic, this is really helpful, thanks!
>>>>>
>>>>> On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon <ped...@cfa.harvard.edu>
>>>>> wrote:
>>>>>
>>>>>> Yes it was later than that. If you are 23.02 you are good.  We've
>>>>>> been running with storing job_scripts on for years at this point and that
>>>>>> part of the database only uses up 8.4G.  Our entire database takes up 29G
>>>>>> on disk. So its about 1/3 of the database.  We also have database
>>>>>> compression which helps with the on disk size. Raw uncompressed our
>>>>>> database is about 90G.  We keep 6 months of data in our active database.
>>>>>>
>>>>>> -Paul Edmon-
>>>>>> On 9/28/2023 1:57 PM, Ryan Novosielski wrote:
>>>>>>
>>>>>> Sorry for the duplicate e-mail in a short time: do you know (or
>>>>>> anyone) when the hashing was added? Was planning to enable this on 21.08,
>>>>>> but we then had to delay our upgrade to it. I’m assuming later than that,
>>>>>> as I believe that’s when the feature was added.
>>>>>>
>>>>>> On Sep 28, 2023, at 13:55, Ryan Novosielski <novos...@rutgers.edu>
>>>>>> <novos...@rutgers.edu> wrote:
>>>>>>
>>>>>> Thank you; we’ll put in a feature request for improvements in that
>>>>>> area, and also thanks for the warning? I thought of that in passing, but
>>>>>> the real world experience is really useful. I could easily see wanting 
>>>>>> that
>>>>>> stuff to be retained less often than the main records, which is what I’d
>>>>>> ask for.
>>>>>>
>>>>>> I assume that archiving, in general, would also remove this stuff,
>>>>>> since old jobs themselves will be removed?
>>>>>>
>>>>>> --
>>>>>> #BlackLivesMatter
>>>>>> ____
>>>>>> || \\UTGERS,
>>>>>> |---------------------------*O*---------------------------
>>>>>> ||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
>>>>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~
>>>>>> RBHS Campus
>>>>>> ||  \\    of NJ  | Office of Advanced Research Computing - MSB
>>>>>> A555B, Newark
>>>>>>      `'
>>>>>>
>>>>>> On Sep 28, 2023, at 13:48, Paul Edmon <ped...@cfa.harvard.edu>
>>>>>> <ped...@cfa.harvard.edu> wrote:
>>>>>>
>>>>>> Slurm should take care of it when you add it.
>>>>>>
>>>>>> So far as horror stories, under previous versions our database size
>>>>>> ballooned to be so massive that it actually prevented us from upgrading 
>>>>>> and
>>>>>> we had to drop the columns containing the job_script and job_env.  This 
>>>>>> was
>>>>>> back before slurm started hashing the scripts so that it would only store
>>>>>> one copy of duplicate scripts.  After this point we found that the
>>>>>> job_script database stayed at a fairly reasonable size as most users use
>>>>>> functionally the same script each time. However the job_env continued to
>>>>>> grow like crazy as there are variables in our environment that change
>>>>>> fairly consistently depending on where the user is. Thus job_envs ended 
>>>>>> up
>>>>>> being too massive to keep around and so we had to drop them. Frankly we
>>>>>> never really used them for debugging. The job_scripts though are super
>>>>>> useful and not that much overhead.
>>>>>>
>>>>>> In summary my recommendation is to only store job_scripts. job_envs
>>>>>> add too much storage for little gain, unless your job_envs are basically
>>>>>> the same for each user in each location.
>>>>>>
>>>>>> Also it should be noted that there is no way to prune out job_scripts
>>>>>> or job_envs right now. So the only way to get rid of them if they get 
>>>>>> large
>>>>>> is to 0 out the column in the table. You can ask SchedMD for the mysql
>>>>>> command to do this as we had to do it here to our job_envs.
>>>>>>
>>>>>> -Paul Edmon-
>>>>>>
>>>>>> On 9/28/2023 1:40 PM, Davide DelVento wrote:
>>>>>>
>>>>>> In my current slurm installation, (recently upgraded to slurm
>>>>>> v23.02.3), I only have
>>>>>>
>>>>>> AccountingStoreFlags=job_comment
>>>>>>
>>>>>> I now intend to add both
>>>>>>
>>>>>> AccountingStoreFlags=job_script
>>>>>> AccountingStoreFlags=job_env
>>>>>>
>>>>>> leaving the default 4MB value for max_script_size
>>>>>>
>>>>>> Do I need to do anything on the DB myself, or will slurm take care of
>>>>>> the additional tables if needed?
>>>>>>
>>>>>> Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I
>>>>>> know about the additional diskspace and potentially load needed, and with
>>>>>> our resources and typical workload I should be okay with that.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: [slurm-users] enabling job script archival

Reply via email to