Hi Ralph, Reuti, I've just observed the same issue without specifying -np.
Please find attached the ps -elfax output from the computing nodes and some sge related information. Regards, Eloi -----Original message----- From:Ralph Castain <r...@open-mpi.org> Sent:Wed 04-11-2012 02:25 pm Subject:Re: [OMPI users] sge tight integration leads to bad allocation To:Open MPI Users <us...@open-mpi.org>; On Apr 11, 2012, at 6:20 AM, Reuti wrote: > Am 11.04.2012 um 04:26 schrieb Ralph Castain: > >> Hi Reuti >> >> Can you replicate this problem on your machine? Can you try it with 1.5? > > No. It's also working fine in 1.5.5 in some tests. I even forced an uneven > distribution by limiting the slots setting for some machines in the queue > configuration. Thanks - that confirms what I've been able to test. It sounds like it is something in Eloi's setup, but I can't fathom what it would be - the allocations all look acceptable. I'm stumped. :-(
[charlie:23181] Warning: could not find environment variable "ACTRAN_DEBUG" [charlie:23181] ras:gridengine: JOB_ID: 1882 [charlie:23181] ras:gridengine: PE_HOSTFILE: /opt/sge/default/spool/charlie/active_jobs/1882.1/pe_hostfile [charlie:23181] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=6 [charlie:23181] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=3 ====================== ALLOCATED NODES ====================== Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 Daemon: [[39528,0],0] Daemon launched: True Num slots: 6 Slots in use: 0 Num slots allocated: 6 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 Daemon: Not defined Daemon launched: False Num slots: 3 Slots in use: 0 Num slots allocated: 3 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 ================================================================= Map generated by mapping policy: 0200 Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE Num new daemons: 1 New daemon starting vpid 1 Num nodes: 2 Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 Daemon: [[39528,0],0] Daemon launched: True Num slots: 6 Slots in use: 5 Num slots allocated: 6 Max slots: 0 Username on node: NULL Num procs: 5 Next node_rank: 5 Data for proc: [[39528,1],0] Pid: 0 Local rank: 0 Node rank: 0 State: 0 App_context: 0 Slot list: NULL Data for proc: [[39528,1],2] Pid: 0 Local rank: 1 Node rank: 1 State: 0 App_context: 0 Slot list: NULL Data for proc: [[39528,1],4] Pid: 0 Local rank: 2 Node rank: 2 State: 0 App_context: 0 Slot list: NULL Data for proc: [[39528,1],6] Pid: 0 Local rank: 3 Node rank: 3 State: 0 App_context: 0 Slot list: NULL Data for proc: [[39528,1],8] Pid: 0 Local rank: 4 Node rank: 4 State: 0 App_context: 0 Slot list: NULL Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 Daemon: [[39528,0],1] Daemon launched: False Num slots: 3 Slots in use: 4 Num slots allocated: 3 Max slots: 0 Username on node: NULL Num procs: 4 Next node_rank: 4 Data for proc: [[39528,1],1] Pid: 0 Local rank: 0 Node rank: 0 State: 0 App_context: 0 Slot list: NULL Data for proc: [[39528,1],3] Pid: 0 Local rank: 1 Node rank: 1 State: 0 App_context: 0 Slot list: NULL Data for proc: [[39528,1],5] Pid: 0 Local rank: 2 Node rank: 2 State: 0 App_context: 0 Slot list: NULL Data for proc: [[39528,1],7] Pid: 0 Local rank: 3 Node rank: 3 State: 0 App_context: 0 Slot list: NULL [charlie:23187] mca: base: component_find: unable to open /opt/openmpi-1.4.4/lib/openmpi/mca_btl_gm: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [charlie:23188] mca: base: component_find: unable to open /opt/openmpi-1.4.4/lib/openmpi/mca_btl_gm: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [charlie:23185] mca: base: component_find: unable to open /opt/openmpi-1.4.4/lib/openmpi/mca_btl_gm: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [charlie:23186] mca: base: component_find: unable to open /opt/openmpi-1.4.4/lib/openmpi/mca_btl_gm: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [charlie:23189] mca: base: component_find: unable to open /opt/openmpi-1.4.4/lib/openmpi/mca_btl_gm: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc [charlie:23189] mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy) warning:regcache incompatible with malloc [charlie:23187] mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy) [carl:00563] mca: base: component_find: unable to open /opt/openmpi-1.4.4/lib/openmpi/mca_btl_gm: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [carl:00562] mca: base: component_find: unable to open /opt/openmpi-1.4.4/lib/openmpi/mca_btl_gm: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [carl:00561] mca: base: component_find: unable to open /opt/openmpi-1.4.4/lib/openmpi/mca_btl_gm: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [carl:00564] mca: base: component_find: unable to open /opt/openmpi-1.4.4/lib/openmpi/mca_btl_gm: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc
job1882.sh
Description: application/shellscript
pselfax.carl
Description: Binary data
pselfax.charlie
Description: Binary data
qstat-gt
Description: Binary data
qstat-j1882
Description: Binary data