And weirdly enough it has now stopped working again, after I did the experimentation for power save described in the other thread. That is really strange. At the highest verbosity level the logs just say
slurmdbd: debug: REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:9984 UID:1457 IP:192.168.2.254 CONN:13 I reconfigured and reverted stuff to no change. Does anybody have any clue? On Tue, Oct 3, 2023 at 5:43 PM Davide DelVento <davide.quan...@gmail.com> wrote: > For others potentially seeing this on mailing list search, yes, I needed > that, which of course required creating an account charge which I wasn't > using. So I ran > > sacctmgr add account default_account > sacctmgr add -i user $user Accounts=default_account > > with an appropriate looping around for $user and everything is working > fine now. > > Thanks everybody! > > On Tue, Oct 3, 2023 at 7:44 AM Paul Edmon <ped...@cfa.harvard.edu> wrote: > >> You will probably need to. >> >> The way we handle it is that we add users when the first submit a job via >> the job_submit.lua script. This way the database autopopulates with active >> users. >> >> -Paul Edmon- >> On 10/3/23 9:01 AM, Davide DelVento wrote: >> >> By increasing the slurmdbd verbosity level, I got additional information, >> namely the following: >> >> slurmdbd: error: couldn't get information for this user (null)(xxxxxx) >> slurmdbd: debug: accounting_storage/as_mysql: >> as_mysql_jobacct_process_get_jobs: User xxxxxx has no associations, and >> is not admin, so not returning any jobs. >> >> again where xxxxx is the posix ID of the user who's running the query in >> the slurmdbd logs. >> >> I suspect this is due to the fact that our userbase is small enough (we >> are a department HPC) that we don't need to use allocation and the like, so >> I have not configured any association (and not even studied its >> configuration, since when I was at another place which did use >> associations, someone else took care of slurm administration). >> >> Anyway, I read the fantastic document by our own member at >> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associations >> and in fact I have not even configured slurm users: >> >> # sacctmgr show user >> User Def Acct Admin >> ---------- ---------- --------- >> root root Administ+ >> # >> >> So is that the issue? Should I just add all users? Any suggestions on the >> minimal (but robust) way to do that? >> >> Thanks! >> >> >> On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento <davide.quan...@gmail.com> >> wrote: >> >>> Thanks Paul, this helps. >>> >>> I don't have any PrivateData line in either config file. According to >>> the docs, "By default, all information is visible to all users" so this >>> should not be an issue. I tried to add a line with "PrivateData=jobs" to >>> the conf files, just in case, but that didn't change the behavior. >>> >>> On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon <ped...@cfa.harvard.edu> >>> wrote: >>> >>>> At least in our setup, users can see their own scripts by doing sacct >>>> -B -j JOBID >>>> >>>> I would make sure that the scripts are being stored and how you have >>>> PrivateData set. >>>> >>>> -Paul Edmon- >>>> On 10/2/2023 10:57 AM, Davide DelVento wrote: >>>> >>>> I deployed the job_script archival and it is working, however it can be >>>> queried only by root. >>>> >>>> A regular user can run sacct -lj towards any jobs (even those by other >>>> users, and that's okay in our setup) with no problem. However if they run >>>> sacct -j job_id --batch-script even against a job they own themselves, >>>> nothing is returned and I get a >>>> >>>> slurmdbd: error: couldn't get information for this user (null)(xxxxxx) >>>> >>>> where xxxxx is the posix ID of the user who's running the query in the >>>> slurmdbd logs. >>>> >>>> Both configure files slurmdbd.conf and slurm.conf do not have any >>>> "permission" setting. FWIW, we use LDAP. >>>> >>>> Is that the expected behavior, in that by default only root can see the >>>> job scripts? I was assuming the users themselves should be able to debug >>>> their own jobs... Any hint on what could be changed to achieve this? >>>> >>>> Thanks! >>>> >>>> >>>> >>>> On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento < >>>> davide.quan...@gmail.com> wrote: >>>> >>>>> Fantastic, this is really helpful, thanks! >>>>> >>>>> On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon <ped...@cfa.harvard.edu> >>>>> wrote: >>>>> >>>>>> Yes it was later than that. If you are 23.02 you are good. We've >>>>>> been running with storing job_scripts on for years at this point and that >>>>>> part of the database only uses up 8.4G. Our entire database takes up 29G >>>>>> on disk. So its about 1/3 of the database. We also have database >>>>>> compression which helps with the on disk size. Raw uncompressed our >>>>>> database is about 90G. We keep 6 months of data in our active database. >>>>>> >>>>>> -Paul Edmon- >>>>>> On 9/28/2023 1:57 PM, Ryan Novosielski wrote: >>>>>> >>>>>> Sorry for the duplicate e-mail in a short time: do you know (or >>>>>> anyone) when the hashing was added? Was planning to enable this on 21.08, >>>>>> but we then had to delay our upgrade to it. I’m assuming later than that, >>>>>> as I believe that’s when the feature was added. >>>>>> >>>>>> On Sep 28, 2023, at 13:55, Ryan Novosielski <novos...@rutgers.edu> >>>>>> <novos...@rutgers.edu> wrote: >>>>>> >>>>>> Thank you; we’ll put in a feature request for improvements in that >>>>>> area, and also thanks for the warning? I thought of that in passing, but >>>>>> the real world experience is really useful. I could easily see wanting >>>>>> that >>>>>> stuff to be retained less often than the main records, which is what I’d >>>>>> ask for. >>>>>> >>>>>> I assume that archiving, in general, would also remove this stuff, >>>>>> since old jobs themselves will be removed? >>>>>> >>>>>> -- >>>>>> #BlackLivesMatter >>>>>> ____ >>>>>> || \\UTGERS, >>>>>> |---------------------------*O*--------------------------- >>>>>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >>>>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ >>>>>> RBHS Campus >>>>>> || \\ of NJ | Office of Advanced Research Computing - MSB >>>>>> A555B, Newark >>>>>> `' >>>>>> >>>>>> On Sep 28, 2023, at 13:48, Paul Edmon <ped...@cfa.harvard.edu> >>>>>> <ped...@cfa.harvard.edu> wrote: >>>>>> >>>>>> Slurm should take care of it when you add it. >>>>>> >>>>>> So far as horror stories, under previous versions our database size >>>>>> ballooned to be so massive that it actually prevented us from upgrading >>>>>> and >>>>>> we had to drop the columns containing the job_script and job_env. This >>>>>> was >>>>>> back before slurm started hashing the scripts so that it would only store >>>>>> one copy of duplicate scripts. After this point we found that the >>>>>> job_script database stayed at a fairly reasonable size as most users use >>>>>> functionally the same script each time. However the job_env continued to >>>>>> grow like crazy as there are variables in our environment that change >>>>>> fairly consistently depending on where the user is. Thus job_envs ended >>>>>> up >>>>>> being too massive to keep around and so we had to drop them. Frankly we >>>>>> never really used them for debugging. The job_scripts though are super >>>>>> useful and not that much overhead. >>>>>> >>>>>> In summary my recommendation is to only store job_scripts. job_envs >>>>>> add too much storage for little gain, unless your job_envs are basically >>>>>> the same for each user in each location. >>>>>> >>>>>> Also it should be noted that there is no way to prune out job_scripts >>>>>> or job_envs right now. So the only way to get rid of them if they get >>>>>> large >>>>>> is to 0 out the column in the table. You can ask SchedMD for the mysql >>>>>> command to do this as we had to do it here to our job_envs. >>>>>> >>>>>> -Paul Edmon- >>>>>> >>>>>> On 9/28/2023 1:40 PM, Davide DelVento wrote: >>>>>> >>>>>> In my current slurm installation, (recently upgraded to slurm >>>>>> v23.02.3), I only have >>>>>> >>>>>> AccountingStoreFlags=job_comment >>>>>> >>>>>> I now intend to add both >>>>>> >>>>>> AccountingStoreFlags=job_script >>>>>> AccountingStoreFlags=job_env >>>>>> >>>>>> leaving the default 4MB value for max_script_size >>>>>> >>>>>> Do I need to do anything on the DB myself, or will slurm take care of >>>>>> the additional tables if needed? >>>>>> >>>>>> Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I >>>>>> know about the additional diskspace and potentially load needed, and with >>>>>> our resources and typical workload I should be okay with that. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>