Re: [OMPI users] v1.3: mca_common_sm_mmap_init error

Reuti Mon, 2 Feb 2009 03:42:43 -0500

Am 01.02.2009 um 12:43 schrieb Jeff Squyres:

Could the nodes be running out of shared memory and/or tempfilesystem space?

I still have this issue, and it happens only from time to time. Butdespite the fact that SGE's qrsh is used automatically, more severeis the fact, that on the slave nodes the orted daemons will be pushedinto daemonland and no longer be bound to the sge_shepherd:


 3173     1 /usr/sge/bin/lx24-x86/sge_execd

3431 1 orted --daemonize -mca ess env -mca orte_ess_jobid81199104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 811

 3432  3431  \_ /home/reuti/mpihello
 3433  3431  \_ /home/reuti/mpihello

-- Reuti

On Jan 29, 2009, at 3:05 PM, Rolf vandeVaart wrote:
I have not seen this before. I assume that for some reason, theshared memory transport layer cannot create the file it uses forcommunicating within a node. Open MPI then selects some othertransport (TCP, openib) to communicate within the node so theprogram runs fine.
The code has not changed that much from 1.2 to 1.3, but it is alittle different. Let me see if I can reproduce the problem.
Rolf

Mostyn Lewis wrote:
Sort of ditto but with SVN release at 20123 (and earlier):

e.g.
[r2250_46:30018] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_46_0/25682/1/shared_mem_pool.r2250_46
failed with errno=2
[r2250_63:05292] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_63_0/25682/1/shared_mem_pool.r2250_63
failed with errno=2
[r2250_57:17527] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_57_0/25682/1/shared_mem_pool.r2250_57
failed with errno=2
[r2250_68:13553] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_68_0/25682/1/shared_mem_pool.r2250_68
failed with errno=2
[r2250_50:06541] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_50_0/25682/1/shared_mem_pool.r2250_50
failed with errno=2
[r2250_49:29237] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_49_0/25682/1/shared_mem_pool.r2250_49
failed with errno=2
[r2250_66:19066] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_66_0/25682/1/shared_mem_pool.r2250_66
failed with errno=2
[r2250_58:24902] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_58_0/25682/1/shared_mem_pool.r2250_58
failed with errno=2
[r2250_69:27426] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_69_0/25682/1/shared_mem_pool.r2250_69
failed with errno=2
[r2250_60:30560] mca_common_sm_mmap_init: open /tmp/45139.1.all.q/openmpi-sessions-mostyn@r2250_60_0/25682/1/shared_mem_pool.r2250_60
failed with errno=2

File not found in sm.
10 of them across 32 nodes (8 cores per node (2 sockets x quad-core))
"Apparently harmless"?

DM

On Tue, 27 Jan 2009, Prentice Bisbal wrote:
I just installed OpenMPI 1.3 with tight integration for SGE.Version1.2.8 was working just fine for several months in the samearrangement.
Now that I've upgraded to 1.3, I get the following errors in mystandard
error file:
mca_common_sm_mmap_init: open /tmp/968.1.all.q/openmpi-sessions-prentice@node09.aurora_0/21400/1/shared_mem_pool.node09.aurora failedwith
errno=2
[node23.aurora:20601] mca_common_sm_mmap_init: open
/tmp/968.1.all.q/openmpi-sessions-prent
ice@node23.aurora_0/21400/1/shared_mem_pool.node23.aurora failedwith
errno=2
[node46.aurora:12118] mca_common_sm_mmap_init: open
/tmp/968.1.all.q/openmpi-sessions-prent
ice@node46.aurora_0/21400/1/shared_mem_pool.node46.aurora failedwith
errno=2
[node15.aurora:12421] mca_common_sm_mmap_init: open
/tmp/968.1.all.q/openmpi-sessions-prent
ice@node15.aurora_0/21400/1/shared_mem_pool.node15.aurora failedwith
errno=2
[node20.aurora:12534] mca_common_sm_mmap_init: open
/tmp/968.1.all.q/openmpi-sessions-prent
ice@node20.aurora_0/21400/1/shared_mem_pool.node20.aurora failedwith
errno=2
[node16.aurora:12573] mca_common_sm_mmap_init: open
/tmp/968.1.all.q/openmpi-sessions-prent
ice@node16.aurora_0/21400/1/shared_mem_pool.node16.aurora failedwith
errno=2
I've tested 3-4 different times, and the number of hosts thatproducesthis error varies, as well as which hosts produce this error. Myprogramseems to run fun, but it's just a simple "Hello, World!"program. Any
ideas? Is this a bug in 1.3?


-- Prentice
--
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] v1.3: mca_common_sm_mmap_init error

Reply via email to