Re: [slurm-users] Slurm Jobscript Archiver

2019-06-17 Thread Kevin Buckley

On 2019/05/09 23:37, Christopher Benjamin Coffey wrote:


Feel free to try it out and let us know how it works for you!

https://github.com/nauhpc/job_archive


So Chris,

testing it out quickly, and dirtily, using an sbatch with a here document, vis:

$ sbatch -p testq  <

Re: [slurm-users] Slurm Jobscript Archiver

2019-06-17 Thread Lech Nieroda
Hi Chris,

you’ll find the patch for our version attached. Integrate it as you see fit, 
personally I’d recommend a branch since the two log files approach isn’t really 
reconcilable with the idea of having separate job files accessible to the 
respective owner.
All filenames and directories are defined with „#define“ pragmas, as it was 
more convenient to have them all in one place.

Kind regards,
Lech



job_archive.patch.gz
Description: GNU Zip compressed data


> Am 15.06.2019 um 00:47 schrieb Christopher Benjamin Coffey 
> :
> 
> Hi Lech,
> 
> I'm glad that it is working out well with the modifications you've put in 
> place! Yes, there can be a huge volume of jobscripts out there. That’s a 
> pretty good way of dealing with it! . We've backed up 1.1M jobscripts since 
> its inception 1.5 months ago and aren't too worried yet about the inode/space 
> usage. We haven't settled in to what we will do to keep the archive clean 
> yet. My thought was:
> 
> - keep two months (directories) of jobscripts for each user, leaving the 
> jobscripts intact for easy user access
> - tar up the month directories that are older than two months
> - keep four tarred months
> 
> That way there would be 6 months of jobscript archive to match our 6 month 
> job accounting retention in the slurm db.
> 
> I'd be interested in your version however, please do send it along! And 
> please keep in touch with how everything goes!
> 
> Best,
> Chris
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
> 
> 
> On 6/14/19, 2:22 AM, "slurm-users on behalf of Lech Nieroda" 
>  lech.nier...@uni-koeln.de> wrote:
> 
>Hello Chris,
> 
>we’ve tried out your archiver and adapted it to our needs, it works quite 
> well.
>The changes:
>- we get lots of jobs per day, ca. 3k-5k, so storing them as individual 
> files would waste too much inodes and 4k-blocks. Instead everything is 
> written into two log files (job_script.log and job_env.log) with the prefix 
> „  “ in each line. In this way one can easily grep 
> and cut the corresponding job script or environment. Long term storage and 
> compression is handled by logrotate, with standard compression settings
>- the parsing part can fail to produce a username, thus we have introduced 
> a customized environment variable that stores the username and can be read 
> directly by the archiver 
>- most of the program’s output, including debug output, is handled by the 
> logger and stored in a jobarchive.log file with an appropriate timestamp
>- the logger uses a va_list to make multi-argument log-oneliners possible
>- signal handling is reduced to the debug-level incease/decrease
>- file handling is mostly relegated to HelperFn, directory trees are now 
> created automatically
>- the binary header of the env-file and the binary footer of the 
> script-file are filtered, thus the resulting files are recognized as ascii 
> files
> 
>If you are interested in our modified version, let me know.
> 
>Kind regards,
>Lech
> 
> 
>> Am 09.05.2019 um 17:37 schrieb Christopher Benjamin Coffey 
>> :
>> 
>> Hi All,
>> 
>> We created a slurm job script archiver which you may find handy. We 
>> initially attempted to do this through slurm with a slurmctld prolog but it 
>> really bogged the scheduler down. This new solution is a custom c++ program 
>> that uses inotify to watch for job scripts and environment files to show up 
>> out in /var/spool/slurm/hash.* on the head node. When they do, the program 
>> copies the jobscript and environment out to a local archive directory. The 
>> program is multithreaded and has a dedicated thread watching each hash 
>> directory. The program is super-fast and lightweight and has no side effects 
>> on the scheduler. The program by default will apply ACLs to the archived job 
>> scripts so that only the owner of the jobscript can read the files. Feel 
>> free to try it out and let us know how it works for you!
>> 
>> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnauhpc%2Fjob_archive&data=02%7C01%7Cchris.coffey%40nau.edu%7Cce8cb62264b84a21e32608d6f0a9d9be%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C1%7C636961009635679145&sdata=k2%2BdZ90EE78r5PQz9GdblaEWIrPoY79T6gwkIcNrxNE%3D&reserved=0
>> 
>> Best,
>> Chris
>> 
>> —
>> Christopher Coffey
>> High-Performance Computing
>> Northern Arizona University
>> 928-523-1167



Re: [slurm-users] Rename account or move user from one account to another

2019-06-17 Thread Henkel, Andreas
Hi Christoph,

I think the only way is to modify the database directly. I don’t know if Slurm 
likes it and personally would try it in a copy of the DB with a separate 
slurmdbd to see if the values reported are still correct. 

Best regards,

Andreas Henkel

> Am 14.06.2019 um 16:16 schrieb Sam Gallop (NBI) :
> 
> Hi Christoph,
> 
> I suspect that the answer to both of these is no. When I tried to modify an 
> account I got ...
> 
> $ sudo sacctmgr modify account where name=user1 set account=newaccount1
> Can't modify the name of an account
> 
> Also, the sacctmgr can only reset a user's rawusage, as it only supports a 
> value of 0.
> 
> While not exactly what you want you could add the user to the new account, 
> change the defaultaccount and then remove the user from the old account. 
> However it doesn't retain the user's historical usage which I guess is 
> ultimately what you want.
> 
> ---
> Sam Gallop
> 
> -Original Message-
> From: slurm-users  On Behalf Of 
> Christoph Brüning
> Sent: 12 June 2019 10:58
> To: slurm-users@lists.schedmd.com
> Subject: [slurm-users] Rename account or move user from one account to another
> 
> Hi everyone,
> 
> is it somehow possible to move a user between accounts together with his/her 
> usage? I.e. transfer the historical resource consumption from one association 
> to another?
> 
> In a related question: is it possible to rename an account?
> 
> While I could, of course, tamper with the underlying MariaDB, it does not 
> exactly appear to be a convenient or elegant solution...
> 
> Best,
> Christoph
> 
> 
> --
> Dr. Christoph Brüning
> Universität Würzburg
> Rechenzentrum
> Am Hubland
> D-97074 Würzburg
> Tel.: +49 931 31-80499
> 


Re: [slurm-users] Slurm Jobscript Archiver

2019-06-17 Thread Christopher Benjamin Coffey
Hi Lech,

I'm glad that it is working out well with the modifications you've put in 
place! Yes, there can be a huge volume of jobscripts out there. That’s a pretty 
good way of keeping it organized! . We've backed up 1.1M jobscripts since its 
inception 1.5 months ago and aren't too worried yet about the inode/space 
usage. We haven't settled in to what we will do to keep the archive clean yet. 
My thought was:

- keep two months (directories) of jobscripts for each user, leaving the 
jobscripts intact for easy user access
- tar up the month directories that are older than two months
- keep four tarred months

That way there would be 6 months of jobscript archive to match our 6 month job 
accounting retention in the slurm db.

I'd be interested in your version however, please do send it along! And please 
keep in touch with how everything goes!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 6/14/19, 2:22 AM, "slurm-users on behalf of Lech Nieroda" 
 
wrote:

Hello Chris,

we’ve tried out your archiver and adapted it to our needs, it works quite 
well.
The changes:
- we get lots of jobs per day, ca. 3k-5k, so storing them as individual 
files would waste too much inodes and 4k-blocks. Instead everything is written 
into two log files (job_script.log and job_env.log) with the prefix 
„  “ in each line. In this way one can easily grep and 
cut the corresponding job script or environment. Long term storage and 
compression is handled by logrotate, with standard compression settings
- the parsing part can fail to produce a username, thus we have introduced 
a customized environment variable that stores the username and can be read 
directly by the archiver 
- most of the program’s output, including debug output, is handled by the 
logger and stored in a jobarchive.log file with an appropriate timestamp
- the logger uses a va_list to make multi-argument log-oneliners possible
- signal handling is reduced to the debug-level incease/decrease
- file handling is mostly relegated to HelperFn, directory trees are now 
created automatically
- the binary header of the env-file and the binary footer of the 
script-file are filtered, thus the resulting files are recognized as ascii files

If you are interested in our modified version, let me know.

Kind regards,
Lech


> Am 09.05.2019 um 17:37 schrieb Christopher Benjamin Coffey 
:
> 
> Hi All,
> 
> We created a slurm job script archiver which you may find handy. We 
initially attempted to do this through slurm with a slurmctld prolog but it 
really bogged the scheduler down. This new solution is a custom c++ program 
that uses inotify to watch for job scripts and environment files to show up out 
in /var/spool/slurm/hash.* on the head node. When they do, the program copies 
the jobscript and environment out to a local archive directory. The program is 
multithreaded and has a dedicated thread watching each hash directory. The 
program is super-fast and lightweight and has no side effects on the scheduler. 
The program by default will apply ACLs to the archived job scripts so that only 
the owner of the jobscript can read the files. Feel free to try it out and let 
us know how it works for you!
> 
> 
https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnauhpc%2Fjob_archive&data=02%7C01%7Cchris.coffey%40nau.edu%7Cce8cb62264b84a21e32608d6f0a9d9be%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C1%7C636961009635679145&sdata=k2%2BdZ90EE78r5PQz9GdblaEWIrPoY79T6gwkIcNrxNE%3D&reserved=0
> 
> Best,
> Chris
> 
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
> 
> 






Re: [slurm-users] Slurm Jobscript Archiver

2019-06-17 Thread Christopher Benjamin Coffey
Thanks Kevin, we'll put a fix in for that.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 6/17/19, 12:04 AM, "Kevin Buckley"  wrote:

On 2019/05/09 23:37, Christopher Benjamin Coffey wrote:

> Feel free to try it out and let us know how it works for you!
> 
> 
https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnauhpc%2Fjob_archive&data=02%7C01%7CChris.Coffey%40nau.edu%7C18e6f8f2342944d6190308d6f2f1fdb5%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636963518510475817&sdata=zTTffeweFY4oTZ5zDwc06wSIPPnQrrV3bbgvRFdbYIc%3D&reserved=0

So Chris,

testing it out quickly, and dirtily, using an sbatch with a here document, 
vis:

$ sbatch -p testq  <

[slurm-users] salloc not able to run sbash script

2019-06-17 Thread Mahmood Naderan
Hi,
May I know why the user is not able to run a qemu interactive job?
According to the configuration which I made, everything should be fine.
Isn't that?

[valipour@rocks7 ~]$ salloc run_qemu.sh
salloc: Granted job allocation 1209
salloc: error: Unable to exec command "run_qemu.sh"
salloc: Relinquishing job allocation 1209
[valipour@rocks7 ~]$ cat run_qemu.sh
#!/bin/bash
#SBATCH --nodelist=compute-0-1
#SBATCH --cores=8
#SBATCH --mem=40G
#SBATCH --partition=QEMU
#SBATCH --account=q20_8
USERN=`whoami`
qemu-system-x86_64 -m 4 -cpu Opteron_G5 -smp cores=8 -hda
win7_sp1_x64.img -boot c -usbdevice tablet -enable-kvm -device
e1000,netdev=host_files -netdev user,net=
10.0.2.0/24,id=host_files,restrict=off,smb=/home/$USERN,smbserver=10.0.2.4
[valipour@rocks7 ~]$ sacctmgr list association
format=user,account,partition,grptres | grep valipour
  valipour  local
  valipour  q20_8   qemu cpu=8,mem=40G
[valipour@rocks7 ~]$ rocks run host compute-0-1 "qemu-system-x86_64 -h |
head -n 1"
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
QEMU emulator version 3.1.0
[valipour@rocks7 ~]$
[valipour@rocks7 ~]$ ls -l run_qemu.sh
-rwxr-xr-x 1 valipour valipour 387 Jun 17 21:38 run_qemu.sh



Regards,
Mahmood


[slurm-users] openmpi / UCX / srun

2019-06-17 Thread Hidas, Dean
Hello,

I am trying to use ucx with slurm/pmix and run into the error below.  The 
following works using mpirun, but what I was hoping was the srun equivalent 
fails.  Is there some flag or configuration I might be missing for slurm?

Works fine:
mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca osc ucx 
./hello

does not work:
srun -n 100 ./hello
slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: 
ERROR: ucp_ep_create failed: Input/output error
slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243 [pmixp_dconn_connect] 
mpi/pmix: ERROR: Cannot establish direct connection to apcpu-005 (1)
slurmstepd: error: apcpu-004 [0] pmixp_server.c:731 [_process_extended_hdr] 
mpi/pmix: ERROR: Unable to connect to 1
slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT 2019-06-17T13:30:11 
***

The configurations for pmix, openmpi, slurm, ucx are the following (on Debian 
8):
pmix 3.1.2
./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2

openmpi 4.0.1
./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1 
--with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 
--with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2 
--with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external 
--disable-dlopen --without-verbs

slurm 19.05.0
./configure --enable-debug --enable-x11 
--with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm 
--prefix=/opt/apps/slurm/19.05.0 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1

ucx 1.5.1
./configure --enable-optimizations --disable-logging --disable-debug 
--disable-assertions --disable-params-check --prefix=/opt/apps/gcc-7_4/ucx/1.5.1

Any advice is much appreciated.

Best,

-Dean



Re: [slurm-users] salloc not able to run sbash script

2019-06-17 Thread mercan

Hi;

Try:

salloc ./run_qemu.sh


Regards;

Ahmet M.


17.06.2019 20:28 tarihinde Mahmood Naderan yazdı:

Hi,
May I know why the user is not able to run a qemu interactive job?
According to the configuration which I made, everything should be 
fine. Isn't that?


[valipour@rocks7 ~]$ salloc run_qemu.sh
salloc: Granted job allocation 1209
salloc: error: Unable to exec command "run_qemu.sh"
salloc: Relinquishing job allocation 1209
[valipour@rocks7 ~]$ cat run_qemu.sh
#!/bin/bash
#SBATCH --nodelist=compute-0-1
#SBATCH --cores=8
#SBATCH --mem=40G
#SBATCH --partition=QEMU
#SBATCH --account=q20_8
USERN=`whoami`
qemu-system-x86_64 -m 4 -cpu Opteron_G5 -smp cores=8 -hda 
win7_sp1_x64.img -boot c -usbdevice tablet -enable-kvm -device 
e1000,netdev=host_files -netdev 
user,net=10.0.2.0/24,id=host_files,restrict=off,smb=/home/$USERN,smbserver=10.0.2.4 
 

[valipour@rocks7 ~]$ sacctmgr list association 
format=user,account,partition,grptres | grep valipour

  valipour  local
  valipour  q20_8   qemu cpu=8,mem=40G
[valipour@rocks7 ~]$ rocks run host compute-0-1 "qemu-system-x86_64 -h 
| head -n 1"
Warning: untrusted X11 forwarding setup failed: xauth key data not 
generated

QEMU emulator version 3.1.0
[valipour@rocks7 ~]$
[valipour@rocks7 ~]$ ls -l run_qemu.sh
-rwxr-xr-x 1 valipour valipour 387 Jun 17 21:38 run_qemu.sh



Regards,
Mahmood