Hi,

Am 17.12.2013 um 22:32 schrieb Brandon Turner:

> I've been struggling with this problem for a few days now and am out of 
> ideas. I am submitting a job using TORQUE on a beowulf cluster. One step 
> involves running mpiexec, and that is where this error occurs. I've found 
> some similar other queries in the past: 
> 
> http://www.open-mpi.org/community/lists/users/att-11378/attachment
> 
> http://www.open-mpi.org/community/lists/users/2013/09/22608.php
> 
> http://www.open-mpi.org/community/lists/users/2009/11/11129.php
> 
> I'm new to using open-mpi so much of this is very new to me. However, it does 
> not seem that my /tmp folder is full as far as I can tell. I've tried 
> reassigning the temporary directory using the MCA attribute (i.e. mpiexec 
> --mca orte_tmpdir_base /home/pathA/pathB process argument1 argument2 
> argument3), but that was unsuccessful as well. Similarly, if thousands of 
> sub-directories are being created, I have no idea where those would be if 
> this is some ext3 violation issue. It's worth noting that when I submit this 
> job--it works on some occassions and not on others. I suspect it has 
> something to do with the nodes that I am assigned and some property of 
> certain nodes that is an issue. 
> 
> It never used to have this problem until a few days ago, and now I mostly 
> can't get it to work except on a few occasions, which makes me think that 
> perhaps it is a node-specific issue. Any thoughts or suggestions would be 
> much appreciated! 

a) As it's not your personal /tmp, but a machine wide, it might be full on this 
particular node.

b) Or the admin changed the permissions on /tmp so that only Torque can 
generate any temporary directory therein, and any additional one created by a 
batch job should go to $TMPDIR which is created and removed by Torque for your 
particular job. It might be that Open MPI is not tightly integrated into your 
Torque installation. Did you ever have the chance to peek on a node whether 
your MPI processes are kids of pbs_mom and not of any ssh connection?

-- Reuti


> Thanks,
> 
> Brandon
> 
> PS I've copied the full error output below:
> [bc11bl08.deac.wfu.edu:31532] opal_os_dirpath_create: Error: Unable to create 
> the sub-directory (/tmp/openmpi-sessions-turn...@bc11bl08.deac.wfu.edu_0) of 
> (/tmp/openmpi-sessions-turn...@bc11bl08.deac.wfu.edu_0/2243/0/7), mkdir 
> failed [1]
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file 
> ../../orte/util/session_dir.c at line 106
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file 
> ../../orte/util/session_dir.c at line 399
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file 
> ../../../../orte/mca/ess/base/ess_base_std_orted.c at line 283
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is 
> attempting to be sent to a process whose contact information is unknown in 
> file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to 
> [[INVALID],INVALID]
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is 
> attempting to be sent to a process whose contact information is unknown in 
> file ../../orte/util/show_help.c at line 627
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file 
> ../../../../../orte/mca/ess/tm/ess_tm_module.c at line 112
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is 
> attempting to be sent to a process whose contact information is unknown in 
> file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to 
> [[INVALID],INVALID]
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is 
> attempting to be sent to a process whose contact information is unknown in 
> file ../../orte/util/show_help.c at line 627
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file 
> ../../orte/runtime/orte_init.c at line 128
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is 
> attempting to be sent to a process whose contact information is unknown in 
> file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to 
> [[INVALID],INVALID]
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is 
> attempting to be sent to a process whose contact information is unknown in 
> file ../../orte/util/show_help.c at line 627
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file 
> ../../orte/orted/orted_main.c at line 357
> =>> PBS: job killed: walltime 3626 exceeded limit 3600
> Terminated
> mpiexec: killing job...
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to