So EasyBuild + Lmod seems the best solution. I'll try. :)
Thank you all!
betta
2018-01-17 17:53 GMT+01:00 Christopher Samuel :
> On 18/01/18 03:50, Patrick Goetz wrote:
>
> Can anyone shed some light on the situation? I'm very surprised that
>> a module script isn't just an explicit command that
Hi,
let's say I need to execute a python script with slurm. The script require
a particular library installed on the system like numpy.
If the library is not installed to the system, it is necessary to install
it on the master AND the nodes, right? This has to be done on each machine
separately or
Ciao Gennaro!
> > *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN*
> > to
> > *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=UNKNOWN*
> >
> > Now, slurm works and the nodes are running. There is only one minor
> problem
> >
> > *error: Node node04 has low real_memory size (7984
-01-16 13:25 GMT+01:00 Elisabetta Falivene :
>
> It seems like the pidfile in systemd and slurm.conf are different. Check
>> if they are the same and if not adjust the slurm.conf pid files. That
>> should prevent systemd from killing slurm.
>>
> Emh, sorry, how I can d
> It seems like the pidfile in systemd and slurm.conf are different. Check
> if they are the same and if not adjust the slurm.conf pid files. That
> should prevent systemd from killing slurm.
>
Emh, sorry, how I can do this?
> On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene,
&g
>
> slurmd: debug2: _slurm_connect failed: Connection refused
>> slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817:
>> Connection refused
>>
>
> This sounds like the compute node cannot connect back to
> slurmctld on the management node, you should check that the
> IP address
Fenoy :
> Hi,
>
> you can not start the slurmd on the headnode. Try running the same command
> on the compute nodes and check the output. If there is any issue it should
> display the reason.
>
> Regards,
> Carlos
>
> On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Faliv
atch queueing system is due to hostname resolution"
>>
>>
>> On 15 January 2018 at 16:30, Elisabetta Falivene > > wrote:
>>
>>> slurmd -Dvvv says
>>>
>>> slurmd: fatal: Unable to determine this slurmd's NodeName
>>>
>&g
15 16:43 GMT+01:00 Carlos Fenoy :
> Are you trying to start the slurmd in the headnode or a compute node?
>
> Can you provide the slurm.conf file?
>
> Regards,
> Carlos
>
> On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene <
> e.faliv...@ilabroma.com> wrote:
>
log or running "slurmd -Dvvv"
>
>
> On Jan 15, 2018 06:42, "Elisabetta Falivene"
> wrote:
>
>> > Anyway I suggest to update the operating system to stretch and fix your
>> > configuration under a more recent version of slurm.
>>
>> I
> Anyway I suggest to update the operating system to stretch and fix your
> configuration under a more recent version of slurm.
I think I'll soon arrive to that :)
b
2018-01-15 14:08 GMT+01:00 Gennaro Oliva :
> Ciao Elisabetta,
>
> On Mon, Jan 15, 2018 at 01:13:27PM +0100,
I did an upgrade from wheezy to jessie (automatically with a normal
dist-upgrade) on a cluster with 8 nodes (up, running and reachable) and
from slurm 2.3.4 to 14.03.9. Overcame some problems booting kernel (thank
you vey much to Gennaro Oliva, btw), now the system is running correctly
with kernel
> Ciao Elisabetta,
>
Ciao Gennaro! :)
>
> On Tue, Jan 09, 2018 at 01:40:19PM +0100, Elisabetta Falivene wrote:
> > The new kernel was installed during an upgrade from Debian 7 Wheezy to
> > Debian 8 Jessie. The upgrade went ok on the 8 nodes of the cluster, but
> not
>
>
> Let me guess: you're running multi-socket systems, and the kernel
> version behind that "3.16.0-4" label is 3.16.51-2, not 3.16.43-2?
>
Nope. On the nodes the version is 3.16.43-2, and on the master dpkg points
that the unloaded kernel is 3.16.43-2+deb8u5
> There seems to be an issue with
al ramdisk and make sure it has the modules you need.
>
> So boot the system in kernel 3.2 and then run:
> mkinitrd 3.16.0-4-amd64
>
>
> How was the kernel version 3.16.0-4-amd64 installed?
>
>
> On 9 January 2018 at 13:16, Elisabetta Falivene
> wrote:
>
>>
yboard so i'm
truly able to do anything.
2018-01-08 12:26 GMT+01:00 Markus Köberl :
> On Monday, 8 January 2018 11:39:32 CET Elisabetta Falivene wrote:
> > Here I am again.
> > In the end, I did the upgrade from debian 7 wheezy to debian 8 jessie in
> > order to update Slur
Here I am again.
In the end, I did the upgrade from debian 7 wheezy to debian 8 jessie in
order to update Slurm and solve some issues with it. It seemed it all went
well. Even slurm problem seemed solved. Then I rebooted the machine and the
problems began. I can't boot the master anymore returning
raised before the execution of the job. What does it mean?
Thank you, thank you, thank you!
2017-11-09 1:07 GMT+01:00 Lachlan Musicman :
> On 9 November 2017 at 10:54, Elisabetta Falivene
> wrote:
>
>> I am the admin and I have no documentation :D I'll try The third option
I am the admin and I have no documentation :D I'll try The third option.
Thank you very much
Il giovedì 9 novembre 2017, Lachlan Musicman ha scritto:
> On 9 November 2017 at 10:35, Elisabetta Falivene > wrote:
>
>> Wow, thank you. There's a way to check which director
Wow, thank you. There's a way to check which directories the master and The
nodes share?
Il mercoledì 8 novembre 2017, Lachlan Musicman ha
scritto:
> On 9 November 2017 at 09:19, Elisabetta Falivene > wrote:
>
>> I'm getting this message anytime I try to exec
I'm getting this message anytime I try to execute any job on my cluster.
(node01 is the name of my first of eight nodes and is up and running)
Trying a python simple script:
*root@mycluster:/tmp# srun python test.py *
*slurmd[node01]: error: task/cgroup: unable to build job physical cores*
*/usr/b
21 matches
Mail list logo