[slurm-dev] slurmd crashed on *some* nodes after "scontrol reconfigure"

2014-03-10 Thread Andy Riebs
x27;m wondering if slurmd should retry when it sees this failure. None of the other nodes were apparently affected. Cheers Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP

[slurm-dev] Re: slurmd crashed on *some* nodes after "scontrol reconfigure"

2014-03-10 Thread Andy Riebs
Clarifying: that was 32 nodes out of a much larger number of nodes in the cluster. On 03/10/2014 03:20 PM, Andy Riebs wrote: I had edited slurm.conf to create a couple of new slurm partitions. In what appears to be a flukey coincidence, the slurmd daemons on 32 contiguous nodes

[slurm-dev] RE: slurmd crashed on *some* nodes after "scontrol reconfigure"

2014-03-11 Thread Andy Riebs
of the other nodes were apparently affected. Cheers Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP

[slurm-dev] Re: nodes shown as allocated but are drained

2014-03-12 Thread Andy Riebs
Hi Barbara, The output of "sinfo -l" and "sinfo -R" may be helpful to figure out what's going on. Andy On 03/12/2014 04:29 AM, Barbara Krasovec wrote: Hi! I noticed that some nodes are shown as allocated, but in fact no jobs are running on them. Therefore no jobs are assigned to the node

[slurm-dev] Re: pmi and hwloc

2014-07-15 Thread Andy Riebs
Is there a reason to have libpmi depend on hwloc for some architectures, even though it's not relevant for RHEL x86_64 clusters today? Andy On 07/13/2014 10:19 AM, Ralph Castain wrote: Just to clarify something: this only occurs when --with-pmi is provided. We *never* link direc

[slurm-dev] Re: pmi and hwloc

2014-07-16 Thread Andy Riebs
Thanks Danny! Andy On 07/15/2014 05:21 PM, Danny Auble wrote: In 14.11 it doesn't. On 07/15/2014 02:18 PM, Andy Riebs wrote: Is there a reason to have libpmi depend on hwloc for some architectures, even though it's not relevant for R

[slurm-dev] slurm-dev Slurm configuration questions, was Re:

2014-08-13 Thread Andy Riebs
Oops; the other essential guideline for getting help is to include a meaningful subject line! On 08/13/2014 10:12 AM, Andy Riebs wrote: Hi Erica, You'll find much of this discussion takes place frequently, most recently about a week ago. To get started,

[slurm-dev] Re:

2014-08-13 Thread Andy Riebs
Hi Erica, You'll find much of this discussion takes place frequently, most recently about a week ago. To get started, [*]It looks like Slurm can't find a mail program. Use $ scontrol show config | grep MailProg to see what program Slurm is looking for. [*]You probably

[slurm-dev] slurm-dev Problems getting Slurm running, was Re: Re: Re:

2014-08-15 Thread Andy Riebs
me=INFINITE State=UP Is there any clue of what may be wrong? Regards, 2014-08-13 11:23 GMT-03:00 Andy Riebs : Hi Erica, You'll find much of this discussion takes place frequently, most recently about a week ago.

[slurm-dev] Re: Intel MPI Performance inconsistency (and workaround)

2014-08-25 Thread Andy Riebs
Assuming this is a Gnu/Linux system, be sure that you have /etc/sysconfig/slurm on all nodes with the line ulimit -l unlimited That can account for differences in processing between system startup and subsequently restarting the daemons by hand. Andy On 08/21/2014 02:42 PM, Jesse Stroik w

[slurm-dev] Re: starting slurmd only after GPUs are fully initialized

2014-08-29 Thread Andy Riebs
One way to work around this is to set the node definition(s) in slurm.conf with "State=DOWN". That way, manual intervention will be required when a node is rebooted, allowing the rest of the system to finish coming up. Andy On 08/29/2014 12:13 PM, Lev Givon wrote: I recently set up slurm 2

[slurm-dev] Re: starting OpenMPI job directly with srun

2014-09-23 Thread Andy Riebs
Lev, if you drop "mpiexec" from your command line, you should see the desired behaviour, i.e., $ srun -n X program (Also, be sure to recognize the difference between "-n" and "-N"!) Andy On 9/23/2014 2:49 PM, Lev Givon wrote: I have OpenMPI 1.8.2 compiled with PMI support enabled and slurm

[slurm-dev] Re: starting OpenMPI job directly with srun

2014-09-23 Thread Andy Riebs
Ahhh... try adding "--mpi=pmi" or "--mpi=pmi2" to your srun command. Andy p.s. If this fixes it, you might want to set the mpi default in slurm.conf appropriately. On 9/23/2014 3:07 PM, Lev Givon wrote: Received from Andy Riebs on Tue, Sep 23, 2014 at 02:57:49PM EDT:

[slurm-dev] Building Slurm 14.03.x with FreeIPMI 1.4.5?

2014-10-02 Thread Andy Riebs
ran out of time. It would help the world's FreeIPMI users if someone would fix this appropriately (which looks like it may need to have a check for which version of FreeIPMI is being used). Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are n

[slurm-dev] Re: reccomended software stack for development?

2014-10-27 Thread Andy Riebs
Hi Manuel, The first rule is "Keep it simple!" I suggest that you start by viewing this as 2 problems: 1. Learning how to work with Slurm 2. Learning how to work with clusters For learning how to work with Slurm, cloning a copy of the repo is a good start.  In the "Developers" note

[slurm-dev] Problem with PMI2 in Slurm 14.03.10

2014-11-21 Thread Andy Riebs
/troubleshooting this issue further. I have attached the test program to this message. We are using RHEL 6.5 x86_64 and Slurm 14.03.10 on this system. Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP #include #include

[slurm-dev] Re: Problem with PMI2 in Slurm 14.03.10

2014-11-21 Thread Andy Riebs
pmi2_allgather). I think you'll see more detailed error report. 2. Optionally you can try to play with slurm fanout tree width (TreeWidth=10/50/100/whatever... configuration option). 2014-11-21 19:20

[slurm-dev] Re: Problem with PMI2 in Slurm 14.03.10

2014-11-22 Thread Andy Riebs
P.S. FYI I work on PMIx plugin now and we will use SLURM communication infrastructure too, thus will be affected with the same problem. So I am quite interested in this effort. 2014-11-22 3:27 GMT+06:00 Andy Riebs : Thanks Artem! We'll keep y

[slurm-dev] Re: Problem with PMI2 in Slurm 14.03.10

2014-11-24 Thread Andy Riebs
several messages if it can't send it using just one. P.S. FYI I work on PMIx plugin now and we will use SLURM communication infrastructure too, thus will be affected with the same problem. So I am quite interested in this effort. 2014-11-22 3:27 GMT+06:00 Andy Riebs :

[slurm-dev] Re: Quick question on use of topology.conf

2014-12-13 Thread Andy Riebs
I agree. The topology.conf file is used to help Slurm place jobs on nodes that are as close to each other as possible wrt the primary interconnect. For example, we typically use InfiniBand for MPI and SHMEM, and Ethernet for Slurm communication, so what's important to us is the location of t

[slurm-dev] slurmstepd: mpi/pmi2: invalid kvs seq from srun

2014-12-21 Thread Andy Riebs
step. The environment: * RHEL 6.5 * Slurm 14.11.1 * SHMEM provided by OpenMPI 1.8.4rc1 * Berkeley UPC 2.18.0, built on OpenMPI The only thing unusual in slurm.conf is MpiDefault=pmi2 (which is probably obvious from the messages). Any ideas? Andy -- Andy Riebs Hewlett-Packard Company High Perfor

[slurm-dev] Re: slurmstepd: mpi/pmi2: invalid kvs seq from srun

2015-01-05 Thread Andy Riebs
s the case: http://bugs.schedmd.com/attachment.cgi?id=1490 Could you port/try this patch? 2014-12-21 21:45 GMT+06:00 Andy Riebs : We are sporadically seeing messages such as these when running on more than 1000

[slurm-dev] Re: slurmstepd: mpi/pmi2: invalid kvs seq from srun

2015-01-05 Thread Andy Riebs
Thanks Moe! (And Artem for the fix!) Andy On 01/05/2015 12:15 PM, je...@schedmd.com wrote: Yes, and we'll probably release it in the next week or two. Quoting Andy Riebs : Will this fix be in 14.11.3? (The system is in the customer's hands, so my ability to test it is limite

[slurm-dev] squeue SEGV error in 14.11.2

2015-01-06 Thread Andy Riebs
deNotAvail(Unavailable:noden[0692,0777,1788,1836]) Does this ring a bell? Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP

[slurm-dev] Trying to restart slurm from scratch

2015-01-06 Thread Andy Riebs
8.295] debug3: JobId=2 required nodes not avail Even a simple "srun -N1 hostname" hits this. We ARE using slurmdbd with mysql, but assuming that this would only impact accounting results. Any guidance on what we should be doing to reset the world? Andy -- Andy Riebs Hewlett-Pa

[slurm-dev] Problem with --nnodes, --ntasks, and --ntasks-per-node?

2015-02-24 Thread Andy Riebs
14.11.3, we see $ srun -N2 --ntasks-per-node=2 hostname hadesn02 hadesn01 $ Was this change intentional? Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP

[slurm-dev] Re: Problem with --nnodes, --ntasks, and --ntasks-per-node?

2015-02-24 Thread Andy Riebs
.11.4 $ srun -N2 --ntasks-per-node=2 hostname compute-1 compute-0 compute-1 compute-0 On Tue, Feb 24, 2015 at 8:51 AM, Andy Riebs wrote: When we moved from Slurm 14.11.2 to 14.11.3, a bunch of our Slurm scripts

[slurm-dev] Update re: Problem with --nnodes, --ntasks, and --ntasks-per-node?

2015-02-24 Thread Andy Riebs
t confused. Again, apologies for the misinformation in the earlier note. Andy On 02/24/15 09:12, Andy Riebs wrote: Serves me right for always running a version behind -- thanks for the info! Andy On 02/24/15 09:10, CB wrote: Re: [slurm-dev] Problem with --nnodes, --n

[slurm-dev] Re: ownership of output files

2015-03-02 Thread Andy Riebs
slurmd has to be run as root on the compute nodes in order to be able to create jobs as other users. On 03/02/2015 10:51 AM, Slurm User wrote: Re: ownership of output files By the way, we do not want to run anything as "root", obviously On Mon, Mar 2, 2015 at 5:53 AM,

[slurm-dev] Re: query only submission times

2015-03-17 Thread Andy Riebs
Hi Scott, The quick & dirty way would be to take your current command, sort on the submit date, then remove the items that you don't want. For example, $ sacct -X -S ... | sort -k3 >sorted_by_submit.log $ # now edit the log file A better long term/reusable solution would be to use a

[slurm-dev] Re: Slurmdbd configuration

2015-03-18 Thread Andy Riebs
Hi, Some suggestions and questions... Do you have Slurm running without accounting? If not, you might find it easier to take it a step at a time, getting a simple cluster running first, and then adding accounting. Can you "ping slurm"? If not, are you sure that you are setting up Slu

[slurm-dev] Re: Newb question about plugins

2015-03-19 Thread Andy Riebs
Michael, Try running "slurmctld -D" which should result in output telling you what's going wrong. Andy On 03/19/2015 01:15 PM, Michael Kit Gilbert wrote: Newb question about plugins Sorry for the basic question, but I am new to slurm and am having some basic problems wi

[slurm-dev] Re: Newb question about plugins

2015-03-19 Thread Andy Riebs
d out, so does it have to be specified to work? Also, since I'm root and root is the owner of the /var/spool/slurm directory, I'm not sure why it's telling me the permissions are incorrect... O

[slurm-dev] Re: Newb question about plugins

2015-03-20 Thread Andy Riebs
enter a time limit and that is what this plugin appears to do.  On Thu, Mar 19, 2015 at 1:46 PM, Andy Riebs wrote: OK, we (or at least I) have reached the point where you

[slurm-dev] Re: very strange behavior with slurmd post upgrade

2015-04-08 Thread Andy Riebs
Wow! Using Slurm to update the software on the cluster? And I'll guess that you frequently ski Tuckerman's Ravine? :-) First, there is the possibility that Slurm is entirely innocent here, and that some other package's update procedure is wiping out things like context files (especially if

[slurm-dev] Re: default memory limit (14.11.5)?

2015-04-09 Thread Andy Riebs
Assuming this is a Gnu/Linux system, try running $ srun bash -c "ulimit -a" to see if your compute nodes have unexpected limits. You may need to add something to /etc/sysconfig/slurm to allow them to match the user environment on your login node.  (If Slurm is s

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Andy Riebs
It has been our experience that it is necessary to rebuild OpenMPI for each major slurm release, such as transitioning from Slurm 14.03.x to 14.11.x. Andy On 04/16/2015 07:49 AM, Ralph Castain wrote: Re: [slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2015-04-16 Thread Andy Riebs
Recompiling openmpi is sufficient, unless something else has changed in openmpi that might require your programs to be rebuilt. On 04/16/2015 09:17 AM, Bjørn-Helge Mevik wrote: Andy Riebs writes: It has been our experience that it is necessary to rebuild OpenMPI for each major slurm

[slurm-dev] Re: prevent slurm from parsing the full script

2015-04-21 Thread Andy Riebs
Hendryk, what sbatch command line options are you using? How are you determining that job 1 got 2 tasks? I just tried the following script, and it correctly ran just 1 task: $ cat test.sh #!/bin/bash #SBATCH --ntasks=1 srun hostname #sbatch --ntasks=4 ## end of script $ sbatch

[slurm-dev] Re: prevent slurm from parsing the full script

2015-04-21 Thread Andy Riebs
Never mind; which I changed "#sbatch" to the correct "#SBATCH", I got 4 tasks. According to the man page, this is a bug. For now, I like Magnus's suggestion :-) On 04/21/2015 08:21 AM, Andy Riebs wrote: Hendryk, what sbatch command line options are

[slurm-dev] Re: Usage of the "deleted" column in clusterName_job_table table

2015-05-12 Thread Andy Riebs
Hi, In exchange for Slurm automatically handling database setup (and reconfiguration when you upgrade to a newer version of Slurm), you have to allow it to do whatever it wants to do with its tables. Rather than adding (or reusing) an existing column in a table, I would suggest creating

[slurm-dev] Re: Usage of the "deleted" column in clusterName_job_table table

2015-05-13 Thread Andy Riebs
. However it will a bit difficult to support it in our case. Still, we need to know answer to our question about "deleted" column in job_table - do you maybe know how it is used? 2015-05-12 16:03 GMT+03:00 Andy Riebs : Hi, In e

[slurm-dev] scancel job_id.step_id fails in Slurm 14.11.3

2015-05-27 Thread Andy Riebs
NAME PARTITION USER TIME NODELIST 18727.0sleep allriebs 0:58 beehive[09-10] Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP

[slurm-dev] Re: scancel job_id.step_id fails in Slurm 14.11.3

2015-05-27 Thread Andy Riebs
Perfect! On 05/27/2015 01:13 PM, Moe Jette wrote: * Changes in Slurm 14.11.7 == -- Fix scancel bug which could return an error on attempt to signal a job step. Quoting Andy Riebs : Using Slurm 14.11.3 on an RHEL 6.5 x86_64 system, scancel has lost the

[slurm-dev] State of the accounting database after a controller failure

2015-06-02 Thread Andy Riebs
=2 CoresPerSocket=12 ThreadsPerCore=2 State=DOWN # # Partitions # #PartitionName=all Nodes=node[01-16] Default=yes Shared=Exclusive MaxTime=620 State=UP PartitionName=all Nodes=node[01-12] Default=yes Shared=Exclusive MaxTime=620 State=UP -- Andy Riebs Hewlett-Packard Company High Performance

[slurm-dev] Re: old version of scontrol, sacct keeps resurrecting

2015-06-04 Thread Andy Riebs
Having never seen your system, here are a couple of shots in the dark: 1. Do you have a cron job that might be starting Slurm, perhaps for a simple failure-restart solution? 2. If "ps --forest" works on your system, it may suggest that some other process is responsible for running the old

[slurm-dev] Re: Installed slurm but cant make it run

2015-06-05 Thread Andy Riebs
For some reason, your Slurm was built to look for slurm.conf, it's configuration file, in the odd path /opt/extlib/slurm/14.11.7/openmpi/1.8.1/gcc/4.9.0/etc/ -- in any case, it's not finding it there. Have you created your slurm.conf file yet? On 06/05/2015 07:07 AM, Aleksejs Fomins wrote:

[slurm-dev] Re: Installed slurm but cant make it run

2015-06-05 Thread Andy Riebs
ed to create the slurm.conf file. Could you explain how to create it or where to read? Aleksejs On 05/06/15 13:56, Andy Riebs wrote: For some reason, your Slurm was built to look for slurm.conf, it's configuration file, in the odd path /opt/extlib/slurm/14.11.7/openmpi/1.8.1/gcc/4.9.0/etc/

[slurm-dev] Restated: slurmctld makes odd decisions about jobs that completed while it was down, was: State of the accounting database after a controller failure

2015-06-08 Thread Andy Riebs
ctld comes back online. Any thoughts? Andy On 06/02/2015 12:16 PM, Andy Riebs wrote: In short, sacct reports "NODE_FAIL" for jobs that were running when the Slurm control node fails. Apologies if this has been fixed recently; I'm still running with slurm 14.11.3 on RHEL 6.5.

[slurm-dev] Re: Problem running OpenMPI over slurm

2015-06-17 Thread Andy Riebs
Paul, Try # echo "ulimit -l unlimited" >/etc/sysconfig/slurm on each of your compute nodes, and then restart slurm on the compute nodes. FWIW, I explicitly set --with-pmi= when I build OpenMPI. Andy On 06/17/2015 10:38 AM, Wiegand, Paul wrote: Greetings We have just started experimenting

[slurm-dev] Re: Problem with MPI applications at the end of the job

2015-08-11 Thread Andy Riebs
To help figure out what is going on, please send the following (to the list, not to me!): [*]Your Slurm configuration file (with private data like IP addresses and node names removed) [*]Your ./configure command lines for [*]Slurm [*]Mpich [*]OpenMPI [*]The

[slurm-dev] Re: Problem with MPI applications at the end of the job

2015-08-11 Thread Andy Riebs
where i replaced private data. Slurm was installed from the RPM's The mpich and openmpi ./configure was set with default options with only --prefix=/sotware/storage/path If any additional information needed please ask me. Best Wishes, Igor On 11/08/15 15:42, Andy Riebs wrote: To help figure

[slurm-dev] Re: PMI2 in Slurm 14.11.8 ?

2015-09-02 Thread Andy Riebs
Chris, Have you specified mpi=pmi2, either in your command line or in slurm.conf? Andy On 09/02/2015 12:52 PM, Moe Jette wrote: "srun --mpi=list" is listing the "mpi" type plugins that it finds the installed based upon logic similar to the "ls" command below. Perhaps you have an old mpi/pm

[slurm-dev] Re: slurm/munge issues

2015-09-14 Thread Andy Riebs
Do you have the same munge key installed on the head node and compute nodes? On 09/14/2015 05:59 PM, Jan Dettmer wrote: Hi, I am having the issue that all nodes in the cluster are listed as down. The network connections work fine and I can ssh to the nodes In slurmctld.log, I get a large num

[slurm-dev] What follows PMI-2?

2015-09-24 Thread Andy Riebs
hese cooperating, competing, or "only just realized that they could be cooperating" activities? Andy -- Andy Riebs New email address! andy.ri...@hpe.com Hewlett-Packard Company High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HP

[slurm-dev] Re: What follows PMI-2?

2015-09-24 Thread Andy Riebs
capitalization of the last letter. Anyway, the PMIx effort is already being integrated in several popular RMs, including Slurm, so I imagine it’s a moot point accept for possibly confusing people searching publications. Ralph On Sep 24, 2015, at 1:42 PM, Andy Riebs wrote: From the

[slurm-dev] slurm-dev summary, was Re: What follows PMI-2?

2015-09-25 Thread Andy Riebs
rifications and corrections gratefully accepted! Andy On 09/24/2015 06:49 PM, Andy Riebs wrote: Ralph, Artem, and Sourav, thanks for the explanation! Andy On 09/24/2015 04:52 PM, Ralph Castain wrote: Hi Andy. I honestly have no idea why those guys did that :-). We’ve kno

[slurm-dev] Re: Share free cpus

2016-01-29 Thread Andy Riebs
Slurmd needs to run as root so that it can start jobs for any of the cluster users. On 01/29/2016 08:10 AM, Benjamin Redling wrote: Am 18.01.2016 um 18:42 schrieb Benjamin Redling: Am 18.01.2016 um 01:39 schrieb Jordan Willis: CompleteWait=60 SlurmdUser=root ^^^

[slurm-dev] Re: new node not connecting to slurm

2016-03-02 Thread Andy Riebs
The first thing I would check would be if the system clocks are in sync, or at least reasonably close. Andy On 03/02/2016 05:09 PM, Berryhill, Jerome wrote: I am running slurm on a small cluster, with the control on a machine running RHEL7.1. Slurm has been working f

[slurm-dev] Just curious: What problem does gethostname_short() solve?

2016-03-04 Thread Andy Riebs
anation of why they should need to do so. Any thoughts? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE

[slurm-dev] Re: Installing slurm with Mellanox OFED

2016-03-14 Thread Andy Riebs
I agree with Chris; I've used Mellanox OFED for a number of years, and Slurm even longer :-) My standard build recipe for Slurm is $ ./configure --prefix=$INSTALL_DIR --with-munge --enable-pam $ make $ make install $ make install-contrib Hope this helps! Andy On 03/11/2016 12:11 AM, Christophe

[slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

2016-04-12 Thread Andy Riebs
Actually, if sinfo is responding, then slurmctld is running! I would check the slurmctld log on your head node and the slurmd log on the compute node(s) to look for hints regarding communication problems. For example, can you ping back and forth between the 2 nodes? Do you have a firewall ru

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Andy Riebs
Is your /mirror directory shared across your cluster? On 04/14/2016 06:56 AM, Husen R wrote: Re: [slurm-dev] Re: Slurm Checkpoint/Restart example Hello Danny, I have tried to restart using "scontrol checkpoint restart " but it doesn't work. In addition, "

[slurm-dev] Re: Slurm Plugin initialization failed

2016-04-21 Thread Andy Riebs
Start by looking for diagnostic messages in slurmd.log. To debug one's own Slurm code, learn to love and use $ slurmd -D -vv Andy On 04/21/2016 11:50 AM, Tanner Satchwell wrote: Slurm Plugin initialization failed I am trying to write a new slurm task plugin to c

[slurm-dev] RE: Ghost jobs

2016-04-25 Thread Andy Riebs
Sorry for the late comments here, but just in case it's relevant: What version of Slurm are you running? There was a bug in Slurm 2.1 where submitting a large number of jobs in short order resulted in what we called "phantom jobs" -- Slurm would periodically stop running jobs, reporting that

[slurm-dev] Re: using gdb to debug slurm-15.08?

2016-04-27 Thread Andy Riebs
[Apologies to the list if someone has already responded to Michael, but I don't recall seeing it.] Hi Michael, By far the easiest way to debug Slurm problems is by doing your own, local build, outside the context of RPM. You can find appropriate ./configure arguments (at least to start)

[slurm-dev] Re: MPI/OpenMPI send receive not working

2016-04-30 Thread Andy Riebs
Hi, The one problem that I see in your description is minor, and probably not significant: the MPI ports parameter was needed for very old versions of Open MPI, IIRC. To help debug your problems, please respond to this list with [*]What command did you use to invoke your program?

[slurm-dev] Re: MPI/OpenMPI send receive not working

2016-04-30 Thread Andy Riebs
COMPUTE NODES NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN 2016-04-30 16:40 GMT+01:00 Andy Riebs : Hi, The one problem that I see in your description

[slurm-dev] Re: MPI/OpenMPI send receive not working

2016-04-30 Thread Andy Riebs
so I think pmi support is installed. And the hello world program is working, would it if it wasn't installed ? 2016-04-30 18:04 GMT+01:00 Andy Riebs : For Slurm, after the "make install", did you do a "make install-contrib" (which

[slurm-dev] cpu_freq_cpu_avail message: Should this be an error or a warning?

2016-05-26 Thread Andy Riebs
to report this situation, but can we downgrade it to a warning? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE

[slurm-dev] Re: cpu_freq_cpu_avail message: Should this be an error or a warning?

2016-05-27 Thread Andy Riebs
Excellent -- nice patch! Andy On 05/27/2016 03:30 AM, Janne Blomqvist wrote: On 2016-05-26 20:04, Andy Riebs wrote: On systems with a static cpu frequency selected in the BIOS, such as "static high performance mode," we see the message "error: _cpu_freq_cpu_avail: Coul

[slurm-dev] Slurm Elastic Computing

2016-07-13 Thread Andy Riebs
Is the "elastic" mode development active? The page at <https://slurm.schedmd.com/elastic_computing.html> was last updated over a year ago, and we were wondering about the current status of the project (and the failing tests mentioned on that page). Andy -- Andy Riebs an

[slurm-dev] Re: Jobs allocated but don't run

2016-08-31 Thread Andy Riebs
Re: [slurm-dev] Jobs allocated but don't run Do you have a firewall running on the master or any of the compute nodes? On 08/31/2016 12:17 AM, James Andrew Venning wrote: Just to add, squeue returns� � � � � � �JOBID PARTITION � � NAME � � USER ST � �

[slurm-dev] Re: Want to contribute, where should I start?

2016-09-14 Thread Andy Riebs
Want to contribute, where should I start? Hi Felipe, I point to these comments whenever someone asks that question :-) From 1: o�� tinkerghost ��� I recommend starting by

[slurm-dev] Re: Want to contribute, where should I start?

2016-09-15 Thread Andy Riebs
[I don't know what happened to the formatting on the first version of this response, I hope this comes out better!] The following note is what I offer as guidance for anyone trying to get a start in FOSS development: From

[slurm-dev] Passing binding information

2016-10-27 Thread Andy Riebs
Hi All, We are trying to figure out the best way for Open MPI (and others?) to pick up the user's CPU binding request. Does this work? Let’s suppose we look for SLURM_CPU_BIND: * if it includes the word “none”, then we know the user specified that they don’t want us to bind *

[slurm-dev] Re: Passing binding information

2016-10-31 Thread Andy Riebs
Does anyone have any recent experience with this code who can answer the questions? On 10/27/2016 01:57 PM, Andy Riebs wrote: Hi All, We are trying to figure out the best way for Open MPI (and others?) to pick up the user's CPU binding request. Does this work?

[slurm-dev] Change in srun buffered output?

2017-01-05 Thread Andy Riebs
functionality? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

[slurm-dev] Re: Change in srun buffered output?

2017-01-10 Thread Andy Riebs
ybody? Cheers, Manuel 2017-01-05 12:00 GMT-05:00 Andy Riebs : Hi Y'all, Historically, our users often use srun from their console windows so that they can watch the progress of their jobs. When we transitioned from 16.05.0 to 16.05.7, we discovered that jobs submitted from the console with

[slurm-dev] Re: Scheduling jobs according to the CPU load

2017-03-19 Thread Andy Riebs
Re: [slurm-dev] Re: Scheduling jobs according to the CPU load Ketiw, Slurm is really good at the incredibly complex job of managing multi-node (tens, hundreds, thousands, ...) workloads where thousands or hundreds of thousands of cooperating threads expect to be able to correspond w

[slurm-dev] Re: Fwd: job requeued in held state

2017-04-03 Thread Andy Riebs
kers Caelum Research Corp. Linux Server and Network Administrator NOAA GLERL -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

[slurm-dev] Re: jobs killed after 24h though walltime is 7 days

2017-04-26 Thread Andy Riebs
epd: error: *** STEP 114498.5 ON n523301 CANCELLED AT 2017-04-26T10:29:08 DUE TO TIME LIMIT ***" The cluster is running 16.05.10. Any ideas on the reason of this or suggestions on how to debug are welcome. Regards, Uwe -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High

[slurm-dev] Re: Compute nodes drained or draining

2017-05-17 Thread Andy Riebs
UPPMAX, Uppsala University, Sweden -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

[slurm-dev] Re: Can't seem to configure for core-based allocation

2017-06-23 Thread Andy Riebs
get slurm to start more than one 4xcore job on a 12-core machine... What am I possibly missing? Thanks! -Mehmet -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the

[slurm-dev] Federated Slurm problem

2017-08-24 Thread Andy Riebs
tname srun: job 67111076 queued and waiting for resources srun: job 67111076 has been allocated resources node01 node02 $ It seems to me that the sibling clusters should be offered the task even if it won't fit on the current cluster. Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Pac

[slurm-dev] Re: slurmstepd error

2017-09-15 Thread Andy Riebs
Actually, it looks like /tmp is missing on the compute nodes? On 09/15/2017 11:12 AM, Doug Meyer wrote: Re: [slurm-dev] slurmstepd error Your path is either erroneous in your submission or inaccessible by the client. Welcome to slurm! Doug On Sep 15, 2017 8:10 AM, "Gyro Funch"

[slurm-dev] Re: slurmstepd error

2017-09-15 Thread Andy Riebs
No such file or directory slurm Thanks. -gyro On 9/15/2017 9:19 AM, Andy Riebs wrote: > Actually, it looks like /tmp is missing on the compute nodes? > > On 09/15/2017 11:12 AM, Doug Meyer wrote: >> Re: [slurm-dev] slurmstepd error

[slurm-dev] Re: slurm with PMI2

2017-09-29 Thread Andy Riebs
FWIW, we include these options when we build mvapich2:     --with-pmi=pmi2 \     --with-pm=slurm \     --with-slurm=/opt/slurm  \     --enable-slurm=yes" It feels like there is some redundancy there, but it works! Andy On 09/29/2017 12:12 PM, Sebastian Eastham wrote: Dear Slu

[slurm-dev] Re: slurm with PMI2

2017-09-29 Thread Andy Riebs
m not clear on how to install these. If there is any way to install them without reinstalling slurm then that would be ideal, but if a reinstall is necessary then that too can be scheduled. Regards, Seb *From:*Andy Riebs [mailto:andy.ri...@hpe.com] *Sent:* Friday, September 29, 2017 12:30 PM

[slurm-dev] Re: slurm with PMI2

2017-10-02 Thread Andy Riebs
and upgrade to slurm v17 whenever we have a quiet weekend on the cluster. Regards, Seb *From:*Andy Riebs [mailto:andy.ri...@hpe.com] *Sent:* Friday, September 29, 2017 12:54 PM *To:* slurm-dev *Subject:* [slurm-dev] Re: slurm with PMI2 Hi Seb, Is there any chance that Slurm 14.11 put pmi2.h

[slurm-dev] Re: Setting up Environment Modules package

2017-10-04 Thread Andy Riebs
We've had good luck putting the modules on an nfs-mounted file system. Along with that, suggest creating /etc/profile.d/zmodule.sh that contains     module use /modules then symlink /etc/profile.d/zmodule.csh to it, and set this up on all login and compute nodes. Andy On 10/04/2017 12:10

[slurm-dev] Intentional change from 16.05.8 to 17.11?

2017-10-05 Thread Andy Riebs
ys Hi! Thread 1 says Hello! Thread 2 says Hi! Thread 1 says Hello! Thread 2 says Hi! $ If I add "-n1" to the command line, I get back to the previous behavior. Is this an intentional change? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Com

[slurm-dev] Re: cannot start slurm daemon - no slurm.sh in /etc/init.d

2012-02-09 Thread Andy Riebs
find /etc/init.d/slurm or /usr/sbin/slurmd,slurmctld daemons. Please help! -- Andy Riebs Hewlett-Packard Company High Performance Computing +1-786-263-9743 My opinions are not necessarily those of HP

[slurm-dev] Re: Getting Started with SLURM

2012-02-28 Thread Andy Riebs
if(count_job_host(hostname[rndm_node])< THRESHOLD) > > { > > allocate_job(hostname[rndm_node],argv) > > break; > > } > > j++; >

[slurm-dev] Re: Problem with MPICH2 communication between nodes

2012-07-02 Thread Andy Riebs
imary/backup) at controlnode/(NULL) are UP/DOWN -- Andy Riebs Hewlett-Packard Company High Performance Computing +1-786-263-9743 My opinions are not necessarily those of HP

[slurm-dev] subverting getlogin()?

2012-08-24 Thread Andy Riebs
are root? The program: --- #include #include int main() { printf("getlogin() returns \"%s\"\n", getlogin()); return 0; } -- Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1-786-263-9743 My opinions are not necessarily those of HP

[slurm-dev] Re: subverting getlogin()?

2012-08-24 Thread Andy Riebs
me since I doubt slurm adds a utmp entry. > > However, checking utmp for a running job might be interesting, and > a workaround might involve setting a utmp entry for slurm jobs > via a plugin. > > mark > > Andy Riebs writes: >> The following trivial program returns

[slurm-dev] Re: subverting getlogin()?

2012-08-24 Thread Andy Riebs
for getlogin(3). > > So if you're lucky, perhaps adding the above line to the > slurm pam stack will "fix" this problem ;-) > > mark > > >> mark >> >> Andy Riebs writes: >>> The following trivial program returns "root" when r

[slurm-dev] Re: subverting getlogin()?

2012-08-27 Thread Andy Riebs
source for getlogin(3). > > So if you're lucky, perhaps adding the above line to the > slurm pam stack will "fix" this problem ;-) > > mark > > >> mark >> >> Andy Riebs writes: >>> The following trivial program returns "root&quo

[slurm-dev] Re: Output and Error Files

2012-08-28 Thread Andy Riebs
(xxx.out) and error (xxx.err) file In the directory it was run. Also, sview reporting that the job is running but itâEUR^(TM)s actually not running. Yinka Adeosun Unix Administrator Vistronix, Inc Contractor to US EPA Chesapeake Bay Program Office 410-295-1323 -- Andy Riebs Hewlett-Packard

  1   2   >