Re: [slurm-users] Getting current memory size of a job

2019-04-07 Thread Mahmood Naderan
Hi again and sorry for the delay



>When I was at Swinburne we asked for this as an enhancement here:
>https://bugs.schedmd.com/show_bug.cgi?id=4966


The output of sstat shows the following error

# squeue -j 821
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
   821   EMERALD g09-test shakerza  R   21:07:18  1
compute-0-0
# sstat -j 821 --format="MaxVMSize,AveRSS,AveVMSize,MaxRSS"
 MaxVMSize AveRSS  AveVMSize MaxRSS
-- -- -- --
sstat: error: no steps running for job 821



> The /proc//cgroup file indicates to which cgroups a process is
assigned, so:
While the job is running, I ssh to node and see

[root@compute-0-0 11220]# cat cgroup
11:hugetlb:/
10:devices:/system.slice/slurmd.service
9:cpuset:/
8:freezer:/
7:cpuacct,cpu:/system.slice/slurmd.service
6:net_prio,net_cls:/
5:blkio:/system.slice/slurmd.service
4:memory:/system.slice/slurmd.service
3:pids:/
2:perf_event:/
1:name=systemd:/system.slice/slurmd.service

But I didn't understand the rest of the message. Can you explain more?



>We use a simple web interface which
is ok for our small cluster.
I moved the files to the following paths and restarted httpd

[root@rocks7 var]# ls -l www/html/simple-web/
total 44
-rwxrwxr-x 1 mahmood mahmood 20729 Apr  7 20:52 code.js
-rwxrwxr-x 1 mahmood mahmood  1406 Apr  7 20:52 favicon.ico
-rwxrwxr-x 1 mahmood mahmood   911 Apr  7 20:52 index.html
-rwxrwxr-x 1 mahmood mahmood  1557 Apr  7 20:52 settings.js
-rwxrwxr-x 1 mahmood mahmood  1322 Apr  7 20:52 style.css
-rwxrwxr-x 1 mahmood mahmood72 Apr  7 20:52 userlist.txt
[root@rocks7 var]# ls -l bin/
total 44
-rwxrwxr-x 1 mahmood mahmood  2058 Apr  7 20:52 myfuncs.py
-rwxrwxr-x 1 mahmood mahmood  1462 Apr  7 20:52 settings.py
-rwxrwxr-x 1 mahmood mahmood  3084 Apr  7 20:52 slurm_cluster_stats.py
-rwxrwxr-x 1 mahmood mahmood  1635 Apr  7 20:52 slurm_pending_tasks.py
-rwxrwxr-x 1 mahmood mahmood   146 Apr  7 20:52 slurm_report_usage_from.sh
-rwxrwxr-x 1 mahmood mahmood   743 Apr  7 20:52
slurm_report_usagepercent_from.sh
-rwxrwxr-x 1 mahmood mahmood95 Apr  7 20:52 slurm_report_years_from.sh
-rwxrwxr-x 1 mahmood mahmood 12617 Apr  7 20:52 slurm_task_tracker.py

However, by entering 10.1.1.1/simple-web a black screen is shown.

Regards,
Mahmood


[slurm-users] MPI job termination

2019-04-07 Thread Mahmood Naderan
Hi,
A multinode MPI job terminated with the following messages in the log file

=--=
   JOB DONE.
=--=
STOP 2
STOP 2
STOP 2
STOP 2
STOP 2
STOP 2
STOP 2
STOP 2
STOP 2
STOP 2
STOP 2
STOP 2
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
STOP 2
STOP 2
--
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[9801,1],8]
  Exit code:2



Although it said job is done, I would like to know if there is any abnormal
termination for that.
Moreover, I can not figure out if there is a problem with the input files
or not. For example, maybe the calculations diverged. But this error can
not clarify that.
Any idea?

Regards,
Mahmood


Re: [slurm-users] MPI job termination

2019-04-07 Thread Reuti


> Am 07.04.2019 um 19:15 schrieb Mahmood Naderan :
> 
> Hi,
> A multinode MPI job terminated with the following messages in the log file
> 
> =--=
>JOB DONE.
> =--=
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> STOP 2
> STOP 2
> --
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[9801,1],8]
>   Exit code:2
> 
> 
> 
> Although it said job is done, I would like to know if there is any abnormal 
> termination for that.
> Moreover, I can not figure out if there is a problem with the input files or 
> not. For example, maybe the calculations diverged. But this error can not 
> clarify that.
> Any idea?

This seems to be unrelated to SLURM.

I assume you are using Open MPI. In Open MPI *all* processes must exit with an 
exit code of zero, otherwiese an error in the application is assumed – even if  
MPI_Finalize() was called before and not MPI_ABORT(). This is of course a point 
of disussion: at least the rank zero should be able to give an exit code 
besides zero back to the calling script (IMO). I suggest to raise this question 
on the Open MPI maling list.

I don't know what the MPI standard says about it, but with Intel MPI it's 
different: an exit after MPI_Finalize() is treated as a normal program 
termination. The highest value returned by any of the processes will be 
returned to the job script and no application error is raised. Hence one can 
act on this return code in a proper way in the job script.

-- Reuti


Re: [slurm-users] Getting current memory size of a job

2019-04-07 Thread Chris Samuel
On Sunday, 7 April 2019 10:13:49 AM PDT Mahmood Naderan wrote:

> The output of sstat shows the following error
> 
> # squeue -j 821
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON) 821   EMERALD g09-test shakerza  R   21:07:18  1
> compute-0-0 # sstat -j 821 --format="MaxVMSize,AveRSS,AveVMSize,MaxRSS"
>  MaxVMSize AveRSS  AveVMSize MaxRSS
> -- -- -- --
> sstat: error: no steps running for job 821

This indicates that you are not using "srun" to launch the MPI application, 
and also that the MPI stack you are using does not know about Slurm and so 
doesn't know to start itself correctly when you run with mpirun.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA