Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-15 Thread tmishima
Further information. I first time encountered this problem in openmpi-1.7.4.x, while opnempi-1.7.3 and 1.6.x works fine. My directory below is "testbed-openmpi-1.7.3", but it's realy 1.7.4a1r29646. I'm sorry, if I confuse you. [mishima@manage testbed-openmpi-1.7.3]$ ompi_info | grep "Open MPI:

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-15 Thread Ralph Castain
Indeed it should - most puzzling. I'll try playing with it on slurm using sbatch and see if I get the same behavior. Offhand, I can't see why the difference would exist unless somehow the script itself is taking one of the execution slots, and somehow Torque is accounting for it. Will have to e

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-14 Thread tmishima
Hi Ralph, It's no problem that you let it lie until the problem becomes serious again. So, this is just an information for you. I agree with your opinion that the problem will lie in the modified hostfile. But exactly speaking, it's related to just adding -hostfile option to mpirun in Torque s

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-14 Thread Ralph Castain
On Nov 14, 2013, at 3:25 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > I checked -cpus-per-proc in openmpi-1.7.4a1r29646. > It works well as I want to do, which can adjust nprocs > of each nodes dividing by number of threads. > > I think my problem is solved so far using -cpus-per

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-14 Thread tmishima
Hi Ralph, I checked -cpus-per-proc in openmpi-1.7.4a1r29646. It works well as I want to do, which can adjust nprocs of each nodes dividing by number of threads. I think my problem is solved so far using -cpus-per-proc, thank you very mush. Regarding oversbuscribed problem, I checked NPROCS was

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-14 Thread Ralph Castain
FWIW: I verified that this works fine under a slurm allocation of 2 nodes, each with 12 slots. I filled the node without getting an "oversbuscribed" error message [rhc@bend001 svn-trunk]$ mpirun -n 3 --bind-to core --cpus-per-proc 4 --report-bindings -hostfile hosts hostname [bend001:24318] MCW

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-14 Thread Ralph Castain
Also, you need to tell mpirun that the nodes aren't the same - add --hetero-nodes to your cmd line On Nov 13, 2013, at 10:14 PM, tmish...@jcity.maeda.co.jp wrote: > > > Thank you, Ralph! > > I didn't know that function of cups-per-proc. > As fas as I know, it didn't work in openmpi-1.6.x lik

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-14 Thread tmishima
Thank you, Ralph! I didn't know that function of cups-per-proc. As fas as I know, it didn't work in openmpi-1.6.x like that. It was just 4 cores binding... Today I don't have much time and I'll check it tomorrow. And thank you again for checking oversubscription problem. tmishima > Guess I d

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread Ralph Castain
Guess I don't see why modifying the allocation is required - we have mapping options that should support such things. If you specify the total number of procs you want, and cpus-per-proc=4, it should do the same thing I would think. You'd get 2 procs on the 8 slot nodes, 8 on the 32 proc nodes,

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread tmishima
Our cluster consists of three types of nodes. They have 8, 32 and 64 slots respectively. Since the performance of each core is almost same, mixed use of these nodes is possible. Furthremore, in this case, for hybrid application with openmpi+openmp, the modification of hostfile is necesarry as fo

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread Ralph Castain
Why do it the hard way? I'll look at the FAQ because that definitely isn't a recommended thing to do - better to use -host to specify the subset, or just specify the desired mapping using all the various mappers we provide. On Nov 13, 2013, at 6:39 PM, tmish...@jcity.maeda.co.jp wrote: > > >

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread tmishima
Sorry for cross-post. Nodefile is very simple which consists of 8 lines: node08 node08 node08 node08 node08 node08 node08 node08 Therefore, NPROCS=8 My aim is to modify the allocation as you pointed out. According to Openmpi FAQ, proper subset of the hosts allocated to the Torque / PBS Pro jo

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread tmishima
Sorry, I forgot to tell you. Nodefile is very simple which consists of 8 lines: node08 node08 node08 node08 node08 node08 node08 node08 tmishima On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Yes, the node08 has 8 slots but the process I run is also 8. > > > > #

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread Ralph Castain
Please - can you answer my question on script2? What is the value of NPROCS? Why would you want to do it this way? Are you planning to modify the allocation?? That generally is a bad idea as it can confuse the system On Nov 13, 2013, at 5:55 PM, tmish...@jcity.maeda.co.jp wrote: > > > Since

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread tmishima
Since what I really want is to run script2 correctly, please let us concentrate script2. I'm not an expert of the inside of openmpi. What I can do is just obsabation from the outside. I doubt these lines are strange, especially the last one. [node08.cluster:26952] mca:rmaps:rr: mapping job [565

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread Ralph Castain
On Nov 13, 2013, at 4:43 PM, tmish...@jcity.maeda.co.jp wrote: > > > Yes, the node08 has 8 slots but the process I run is also 8. > > #PBS -l nodes=node08:ppn=8 > > Therefore, I think it should allow this allocation. Is that right? Correct > > My question is why scritp1 works and script2 d

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread tmishima
Yes, the node08 has 8 slots but the process I run is also 8. #PBS -l nodes=node08:ppn=8 Therefore, I think it should allow this allocation. Is that right? My question is why scritp1 works and script2 does not. They are almost same. #PBS -l nodes=node08:ppn=8 export OMP_NUM_THREADS=1 cd $PBS_

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread Ralph Castain
I guess here's my confusion. If you are using only one node, and that node has 8 allocated slots, then we will not allow you to run more than 8 processes on that node unless you specifically provide the --oversubscribe flag. This is because you are operating in a managed environment (in this cas

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread Ralph Castain
It has nothing to do with LAMA as you aren't using that mapper. How many nodes are in this allocation? On Nov 13, 2013, at 4:06 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, this is an additional information. > > Here is the main part of output by adding "-mca rmaps_base_verbose 50".

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread tmishima
Hi Ralph, this is an additional information. Here is the main part of output by adding "-mca rmaps_base_verbose 50". [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm creating map [node08.cluster:26952] [[56581,0],0] plm:base:setup_vm

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread tmishima
Hi Ralph, This is the result of adding -mca ras_base_verbose 50. SCRIPT: mpirun -machinefile pbs_hosts -np ${NPROCS} -report-bindings -bind-to core \ -mca ras_base_verbose 50 -mca plm_base_verbose 5 ./mPre OUTPUT: [node08.cluster:26770] mca:base:select:( plm) Querying component [rsh] [

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread Ralph Castain
Hmmm...looks like we aren't getting your allocation. Can you rerun and add -mca ras_base_verbose 50? On Nov 12, 2013, at 11:30 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > Here is the output of "-mca plm_base_verbose 5". > > [node08.cluster:23573] mca:base:select:( plm) Queryin

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread tmishima
Hi Ralph, Here is the output of "-mca plm_base_verbose 5". [node08.cluster:23573] mca:base:select:( plm) Querying component [rsh] [node08.cluster:23573] [[INVALID],INVALID] plm:rsh_lookup on agent /usr/bin/rsh path NULL [node08.cluster:23573] mca:base:select:( plm) Query of component [rsh] se

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread tmishima
Hi Ralph, Okey, I can help you. Please give me some time to report the output. Tetsuya Mishima > I can try, but I have no way of testing Torque any more - so all I can do is a code review. If you can build --enable-debug and add -mca plm_base_verbose 5 to your cmd line, I'd appreciate seeing t

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread Ralph Castain
I can try, but I have no way of testing Torque any more - so all I can do is a code review. If you can build --enable-debug and add -mca plm_base_verbose 5 to your cmd line, I'd appreciate seeing the output. On Nov 12, 2013, at 9:58 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > >

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-13 Thread tmishima
Hi Ralph, Thank you for your quick response. I'd like to report one more regressive issue about Torque support of openmpi-1.7.4a1r29646, which might be related to "#3893: LAMA mapper has problems" I reported a few days ago. The script below does not work with openmpi-1.7.4a1r29646, although it

Re: [OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-12 Thread Ralph Castain
Done - thanks! On Nov 12, 2013, at 7:35 PM, tmish...@jcity.maeda.co.jp wrote: > > > Dear openmpi developers, > > I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 built by > PGI13.10 as shown below: > > [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc 2 > -r

[OMPI users] Segmentation fault in oob_tcp.c of openmpi-1.7.4a1r29646

2013-11-12 Thread tmishima
Dear openmpi developers, I got a segmentation fault in traial use of openmpi-1.7.4a1r29646 built by PGI13.10 as shown below: [mishima@manage testbed-openmpi-1.7.3]$ mpirun -np 4 -cpus-per-proc 2 -report-bindings mPre [manage.cluster:23082] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[c