[OMPI users] gpudirect p2p?

2011-10-14 Thread Chris Cooper
Hi,

Are the recent peer to peer capabilities of cuda leveraged by Open MPI
when eg you're running a rank per gpu on the one workstation?

It seems in my testing that I only get in the order of about 1GB/s as
per http://www.open-mpi.org/community/lists/users/2011/03/15823.php,
whereas nvidia's simpleP2P test indicates ~6 GB/s.

Also, I ran into a problem just trying to test.  It seems you have to
do cudaSetDevice/cuCtxCreate with the appropriate gpu id which I was
wanting to derive from the rank.  You don't however know the rank
until after MPI_Init() and you need to initialize cuda before.  Not
sure if there's a standard way to do it?  I have a workaround atm.

Thanks,
Chris


Re: [OMPI users] gpudirect p2p?

2011-10-14 Thread Rolf vandeVaart
>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Chris Cooper
>Sent: Friday, October 14, 2011 1:28 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] gpudirect p2p?
>
>Hi,
>
>Are the recent peer to peer capabilities of cuda leveraged by Open MPI when
>eg you're running a rank per gpu on the one workstation?

Currently, no.  I am actively working on adding that capability. 

>
>It seems in my testing that I only get in the order of about 1GB/s as per
>http://www.open-mpi.org/community/lists/users/2011/03/15823.php,
>whereas nvidia's simpleP2P test indicates ~6 GB/s.
>
>Also, I ran into a problem just trying to test.  It seems you have to do
>cudaSetDevice/cuCtxCreate with the appropriate gpu id which I was wanting
>to derive from the rank.  You don't however know the rank until after
>MPI_Init() and you need to initialize cuda before.  Not sure if there's a
>standard way to do it?  I have a workaround atm.
>

The recommended way is to put the GPU in exclusive mode first.

#nvidia-smi -c 1

Then, have this kind of snippet at the beginning of the program. (this is driver
API, probably should use runtime API)

res = cuInit(0);
if (CUDA_SUCCESS != res) {
exit(1);
} 

if(CUDA_SUCCESS != cuDeviceGetCount(&cuDevCount)) {
exit(2);
}
for (device = 0; device < cuDevCount; device++) {
if (CUDA_SUCCESS != (res = cuDeviceGet(&cuDev, device))) {
exit(3);
}
if (CUDA_SUCCESS != cuCtxCreate(&ctx, 0, cuDev)) {
 /* Another process must have grabbed it.  Go to the next one. */
} else {
break;
}
i++;
}



>Thanks,
>Chris
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---



Re: [OMPI users] MPI_Waitany segfaults or (maybe) hangs

2011-10-14 Thread Francesco Salvadore
Dear MPI users,

using Valgrind I found that the possibile error (which leads to segfault or 
hanging) comes from:


==10334== Conditional jump or move depends on uninitialised value(s)
==10334==    at 0xB150740: btl_openib_handle_incoming 
(btl_openib_component.c:2888)
==10334==    by 0xB1525A2: handle_wc (btl_openib_component.c:3189)
==10334==    by 0xB150390: btl_openib_component_progress 
(btl_openib_component.c:3462)
==10334==    by 0x581DDD6: opal_progress (opal_progress.c:207)
==10334==    by 0x52A75DE: ompi_request_default_wait_any (req_wait.c:154)
==10334==    by 0x52ED449: PMPI_Waitany (pwaitany.c:70)
==10334==    by 0x50541BF: MPI_WAITANY (pwaitany_f.c:86)
==10334==    by 0x4ECCC1: mpiwaitany_ (parallelutils.f:1374)
==10334==    by 0x4ECB18: waitanymessages_ (parallelutils.f:1295)
==10334==    by 0x484249: cutman_v_ (grid.f:490)
==10334==    by 0x40DE62: MAIN__ (cosa.f:379)
==10334==    by 0x40BEFB: main (in 
/work/ady/fsalvado/CAMPOBASSO/CASPUR_MPI/4_MPI/crashtest-valgrind/cosa.mpi)
==10334==
==10334== Use of uninitialised value of size 8
==10334==    at 0xB150764: btl_openib_handle_incoming 
(btl_openib_component.c:2892)
==10334==    by 0xB1525A2: handle_wc (btl_openib_component.c:3189)
==10334==    by 0xB150390: btl_openib_component_progress 
(btl_openib_component.c:3462)
==10334==    by 0x581DDD6: opal_progress (opal_progress.c:207)
==10334==    by 0x52A75DE: ompi_request_default_wait_any (req_wait.c:154)
==10334==    by 0x52ED449: PMPI_Waitany (pwaitany.c:70)
==10334==    by 0x50541BF: MPI_WAITANY (pwaitany_f.c:86)
==10334==    by 0x4ECCC1: mpiwaitany_ (parallelutils.f:1374)
==10334==    by 0x4ECB18: waitanymessages_ (parallelutils.f:1295)
==10334==    by 0x484249: cutman_v_ (grid.f:490)
==10334==    by 0x40DE62: MAIN__ (cosa.f:379)
==10334==    by 0x40BEFB: main (in 
/work/ady/fsalvado/CAMPOBASSO/CASPUR_MPI/4_MPI/crashtest-valgrind/cosa.mpi)

valgrind complains even without using eager_rdma (while the code seems to work 
in such a case) but complains much less using tcp/ip. there are many other 
valgrind warning after these and I can send the complete valgrind output if 
needed.

the messages recall something from another thread

http://www.open-mpi.org/community/lists/users/2010/09/14324.php

which, however, concluded without any direct solution.

can anyone help me in identifying the source of the bug (code or MPI bug)?

thanks
Francesco

From: Francesco Salvadore 
To: "us...@open-mpi.org" 
Sent: Saturday, October 8, 2011 10:06 AM
Subject: [OMPI users] MPI_Waitany segfaults or (maybe) hangs


Dear MPI users, 

I am struggling against the bad behaviour of a MPI code. These are the 
basic informations: 

a) fortran intel11 or intel 12 and OpenMPI 1.4.1 and 1.4.3 give the same 
problem. activating -traceback compiler option, I see the program stops 
at MPI_Waitany. MPI_Waitany waits for the completion of an array of 
MPI_IRecv: looping for the number of array components at the end all 
receives should be completed. 
The programs stops at unpredictable points (after 1 or 5 or 24 hours of 
computation). Sometimes I have sigsegv: 

mca_btl_openib.so  2BA74D29D181  Unknown   Unknown  Unknown 
mca_btl_openib.so  2BA74D29C6FF  Unknown   Unknown  Unknown 
mca_btl_openib.so  2BA74D29C033  Unknown   Unknown  Unknown 
libopen-pal.so.0   2BA74835C3E6  Unknown   Unknown  Unknown 
libmpi.so.0    2BA747E485AD  Unknown   Unknown  Unknown 
libmpi.so.0    2BA747E7857D  Unknown   Unknown  Unknown 
libmpi_f77.so.0    2BA747C047C4  Unknown   Unknown  Unknown 
cosa.mpi   004F856B  waitanymessages_ 1292  
parallelutils.f 
cosa.mpi   004C8044  cutman_q_    2084  bc.f 
cosa.mpi   00413369  smooth_  2029  cosa.f 
cosa.mpi   00410782  mg_   810  cosa.f 
cosa.mpi   0040FB78  MAIN__    537  cosa.f 
cosa.mpi   0040C1FC  Unknown   Unknown  Unknown 
libc.so.6  2BA7490AE994  Unknown   Unknown  Unknown 
cosa.mpi   0040C109  Unknown   Unknown  Unknown 
-- 
mpirun has exited due to process rank 34 with PID 10335 on 
node neo251 exiting without calling "finalize". This may 
have caused other processes in the application to be 
terminated by signals sent by mpirun (as reported here). 
-- 

Waitanymessages is just a wrapper of MPI_Waitany. Sometimes, the run 
stops writing anything on screen and I do not know what is happening 
(probably MPI_Waitany hangs). Before reaching segafault or hanging, 
results are always correct, as checked using the serial version of the 
code. 

b) The problem occurs only using ope

[OMPI users] Error when using more than 88 processors for a specific executable -Abyss

2011-10-14 Thread Ashwani Kumar Mishra
Hello,
When i try to run the following command i receive the following error when i
try to submit this job on the cluster having 40 nodes with each node having
8 processor & 8 GB RAM:

Both the command work well, as long as i use only upto 88 processors in the
cluster, but the moment i allocate more than 88 processors it gives me the
below 2 errors:

I tried to set the ulimit to unlimited & setting mca parameter
opal_set_max_sys_limits to 1 but still the problem wont go.


$ mpirun=/*opt*/psc/ompi/bin/mpirun abyss-pe np=100 name=cattle k=50 n=10
in=s_1_1_sequence.txt

/opt/mpi/openmpi/1.3.3/intel/
bin/mpirun -np 100 ABYSS-P -k50 -q3  --coverage-hist=coverage.hist -s
cattle-bubbles.fa  -o cattle-1.fa s_1_1_sequence.txt
[coe:19807 ] [[62863,0],0] ORTE_ERROR_LOG: The
system limit on number of pipes a process can open was reached in file
base/iof_base_setup.c at line 107
[coe.:19807 ] [[62863,0],0] ORTE_ERROR_LOG:
The system limit on number of pipes a process can open was reached in file
odls_default_module.c at line 203
[coe.:19807 ] [[62863,0],0] ORTE_ERROR_LOG:
The system limit on number of network connections a process can open was
reached in file oob_tcp.c at line 447
--
Error: system limit exceeded on number of network connections that can be
open

This can be resolved by setting the mca parameter opal_set_max_sys_limits to
1,
increasing your limit descriptor setting (using limit or ulimit commands),
or asking the system administrator to increase the system limit.
--
make: *** [cattle-1.fa] Error 1



*
When i submit the same job through qsub, i receive the following error:*
$ qsub  -cwd -pe  orte 100 -o qsub.out -e qsub.err -b y -N  abyss `which
mpirun` /home/genome/abyss/bin/ABYSS-P -k 50 s_1_1_sequence.txt -o av


[compute-0-19.local][[28273,1]
,125][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 173.16.255.231 failed: Connection refused (111)
[compute-0-19.local][[28273,1],127][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 173.16.255.231 failed: Connection refused (111)
[compute-0-23.local][[28273,1],135][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 173.16.255.228 failed: Connection refused (111)
[compute-0-23.local][[28273,1],133][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 173.16.255.228 failed: Connection refused (111)
[compute-0-4.local][[28273,1],113][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 173.16.255.231 failed: Connection refused (111)



Best Regards,
Ashwani


Re: [OMPI users] Error when using more than 88 processors for a specific executable -Abyss

2011-10-14 Thread Ralph Castain
Can't offer much about the qsub job. On the first one, what is your limit on 
the number of file descriptors? Could be your sys admin has it too low.


On Oct 14, 2011, at 12:07 PM, Ashwani Kumar Mishra wrote:

> Hello,
> When i try to run the following command i receive the following error when i 
> try to submit this job on the cluster having 40 nodes with each node having 8 
> processor & 8 GB RAM:
> 
> Both the command work well, as long as i use only upto 88 processors in the 
> cluster, but the moment i allocate more than 88 processors it gives me the 
> below 2 errors:
> 
> I tried to set the ulimit to unlimited & setting mca parameter 
> opal_set_max_sys_limits to 1 but still the problem wont go.
> 
> 
> $ mpirun=/opt/psc/ompi/bin/mpirun abyss-pe np=100 name=cattle k=50 n=10  
> in=s_1_1_sequence.txt
> 
> /opt/mpi/openmpi/1.3.3/intel/
> bin/mpirun -np 100 ABYSS-P -k50 -q3  --coverage-hist=coverage.hist -s 
> cattle-bubbles.fa  -o cattle-1.fa s_1_1_sequence.txt
> [coe:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of pipes 
> a process can open was reached in file base/iof_base_setup.c at line 107
> [coe.:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
> pipes a process can open was reached in file odls_default_module.c at line 203
> [coe.:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
> network connections a process can open was reached in file oob_tcp.c at line 
> 447
> --
> Error: system limit exceeded on number of network connections that can be open
> 
> This can be resolved by setting the mca parameter opal_set_max_sys_limits to 
> 1,
> increasing your limit descriptor setting (using limit or ulimit commands),
> or asking the system administrator to increase the system limit.
> --
> make: *** [cattle-1.fa] Error 1
> 
> 
> 
> 
> When i submit the same job through qsub, i receive the following error:
> $ qsub  -cwd -pe  orte 100 -o qsub.out -e qsub.err -b y -N  abyss `which 
> mpirun` /home/genome/abyss/bin/ABYSS-P -k 50 s_1_1_sequence.txt -o av
> 
> 
> [compute-0-19.local][[28273,1]
> ,125][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] connect() 
> to 173.16.255.231 failed: Connection refused (111)
> [compute-0-19.local][[28273,1],127][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 173.16.255.231 failed: Connection refused (111)
> [compute-0-23.local][[28273,1],135][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 173.16.255.228 failed: Connection refused (111)
> [compute-0-23.local][[28273,1],133][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 173.16.255.228 failed: Connection refused (111)
> [compute-0-4.local][[28273,1],113][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 173.16.255.231 failed: Connection refused (111)
> 
> 
> 
> Best Regards,
> Ashwani
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_accept - Busy wait

2011-10-14 Thread Thatyene Louise Alves de Souza Ramos
Does anyone have any idea?

---
Thatyene Ramos

On Fri, Oct 7, 2011 at 12:01 PM, Thatyene Louise Alves de Souza Ramos <
thaty...@gmail.com> wrote:

> Hi there!
>
> In my code I use MPI_Comm_accept in a server-client communication. I
> noticed that the server remains on busy wait whereas waiting for clients
> connections, using 100% of CPU if there are no other processes running.
>
> I wonder if there is any way to prevent this from happening.
>
> Thanks in advance.
>
> Thatyene Ramos
>


Re: [OMPI users] Error when using more than 88 processors for a specific executable -Abyss

2011-10-14 Thread Ashwani Kumar Mishra
Hi Ralph,

fs.file-max = 10

is this ok or less?

Best Regards,
Ashwani


On Fri, Oct 14, 2011 at 11:45 PM, Ralph Castain  wrote:

> Can't offer much about the qsub job. On the first one, what is your limit
> on the number of file descriptors? Could be your sys admin has it too low.
>
>
> On Oct 14, 2011, at 12:07 PM, Ashwani Kumar Mishra wrote:
>
> Hello,
> When i try to run the following command i receive the following error when
> i try to submit this job on the cluster having 40 nodes with each node
> having 8 processor & 8 GB RAM:
>
> Both the command work well, as long as i use only upto 88 processors in the
> cluster, but the moment i allocate more than 88 processors it gives me the
> below 2 errors:
>
> I tried to set the ulimit to unlimited & setting mca parameter
> opal_set_max_sys_limits to 1 but still the problem wont go.
>
>
> $ mpirun=/*opt*/psc/ompi/bin/mpirun abyss-pe np=100 name=cattle k=50 n=10
> in=s_1_1_sequence.txt
>
> /opt/mpi/openmpi/1.3.3/intel/
> bin/mpirun -np 100 ABYSS-P -k50 -q3  --coverage-hist=coverage.hist -s
> cattle-bubbles.fa  -o cattle-1.fa s_1_1_sequence.txt
> [coe:19807 ] [[62863,0],0] ORTE_ERROR_LOG:
> The system limit on number of pipes a process can open was reached in file
> base/iof_base_setup.c at line 107
> [coe.:19807 ] [[62863,0],0] ORTE_ERROR_LOG:
> The system limit on number of pipes a process can open was reached in file
> odls_default_module.c at line 203
> [coe.:19807 ] [[62863,0],0] ORTE_ERROR_LOG:
> The system limit on number of network connections a process can open was
> reached in file oob_tcp.c at line 447
> --
> Error: system limit exceeded on number of network connections that can be
> open
>
> This can be resolved by setting the mca parameter opal_set_max_sys_limits
> to 1,
> increasing your limit descriptor setting (using limit or ulimit commands),
> or asking the system administrator to increase the system limit.
> --
> make: *** [cattle-1.fa] Error 1
>
>
>
> *
> When i submit the same job through qsub, i receive the following error:*
> $ qsub  -cwd -pe  orte 100 -o qsub.out -e qsub.err -b y -N  abyss `which
> mpirun` /home/genome/abyss/bin/ABYSS-P -k 50 s_1_1_sequence.txt -o av
>
>
> [compute-0-19.local][[28273,1]
> ,125][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
> connect() to 173.16.255.231 failed: Connection refused (111)
> [compute-0-19.local][[28273,1],127][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
> connect() to 173.16.255.231 failed: Connection refused (111)
> [compute-0-23.local][[28273,1],135][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
> connect() to 173.16.255.228 failed: Connection refused (111)
> [compute-0-23.local][[28273,1],133][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
> connect() to 173.16.255.228 failed: Connection refused (111)
> [compute-0-4.local][[28273,1],113][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
> connect() to 173.16.255.231 failed: Connection refused (111)
>
>
>
> Best Regards,
> Ashwani
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Error when using more than 88 processors for a specific executable -Abyss

2011-10-14 Thread Ralph Castain
Should be plenty for us - does your program consume a lot?


On Oct 14, 2011, at 12:25 PM, Ashwani Kumar Mishra wrote:

> Hi Ralph,
> fs.file-max = 10
> is this ok or less?
> 
> Best Regards,
> Ashwani
> 
> 
> On Fri, Oct 14, 2011 at 11:45 PM, Ralph Castain  wrote:
> Can't offer much about the qsub job. On the first one, what is your limit on 
> the number of file descriptors? Could be your sys admin has it too low.
> 
> 
> On Oct 14, 2011, at 12:07 PM, Ashwani Kumar Mishra wrote:
> 
>> Hello,
>> When i try to run the following command i receive the following error when i 
>> try to submit this job on the cluster having 40 nodes with each node having 
>> 8 processor & 8 GB RAM:
>> 
>> Both the command work well, as long as i use only upto 88 processors in the 
>> cluster, but the moment i allocate more than 88 processors it gives me the 
>> below 2 errors:
>> 
>> I tried to set the ulimit to unlimited & setting mca parameter 
>> opal_set_max_sys_limits to 1 but still the problem wont go.
>> 
>> 
>> $ mpirun=/opt/psc/ompi/bin/mpirun abyss-pe np=100 name=cattle k=50 n=10  
>> in=s_1_1_sequence.txt
>> 
>> /opt/mpi/openmpi/1.3.3/intel/
>> bin/mpirun -np 100 ABYSS-P -k50 -q3  --coverage-hist=coverage.hist -s 
>> cattle-bubbles.fa  -o cattle-1.fa s_1_1_sequence.txt
>> [coe:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
>> pipes a process can open was reached in file base/iof_base_setup.c at line 
>> 107
>> [coe.:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
>> pipes a process can open was reached in file odls_default_module.c at line 
>> 203
>> [coe.:19807] [[62863,0],0] ORTE_ERROR_LOG: The system limit on number of 
>> network connections a process can open was reached in file oob_tcp.c at line 
>> 447
>> --
>> Error: system limit exceeded on number of network connections that can be 
>> open
>> 
>> This can be resolved by setting the mca parameter opal_set_max_sys_limits to 
>> 1,
>> increasing your limit descriptor setting (using limit or ulimit commands),
>> or asking the system administrator to increase the system limit.
>> --
>> make: *** [cattle-1.fa] Error 1
>> 
>> 
>> 
>> 
>> When i submit the same job through qsub, i receive the following error:
>> $ qsub  -cwd -pe  orte 100 -o qsub.out -e qsub.err -b y -N  abyss `which 
>> mpirun` /home/genome/abyss/bin/ABYSS-P -k 50 s_1_1_sequence.txt -o av
>> 
>> 
>> [compute-0-19.local][[28273,1]
>> ,125][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] 
>> connect() to 173.16.255.231 failed: Connection refused (111)
>> [compute-0-19.local][[28273,1],127][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>  connect() to 173.16.255.231 failed: Connection refused (111)
>> [compute-0-23.local][[28273,1],135][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>  connect() to 173.16.255.228 failed: Connection refused (111)
>> [compute-0-23.local][[28273,1],133][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>  connect() to 173.16.255.228 failed: Connection refused (111)
>> [compute-0-4.local][[28273,1],113][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>>  connect() to 173.16.255.231 failed: Connection refused (111)
>> 
>> 
>> 
>> Best Regards,
>> Ashwani
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_accept - Busy wait

2011-10-14 Thread Ralph Castain
Sorry - been occupied. This is normal behavior. As has been discussed on this 
list before, OMPI made a design decision to minimize latency. This means we 
aggressively poll for connections. Only thing you can do is tell it to yield 
the processor when idle so, if something else is trying to run, we will let it 
get in there a little earlier. Use -mca mpi_yield_when_idle 1

However, we have seen that if no other user processes are trying to run, then 
the scheduler hands the processor right back to you - and you'll still see that 
100% number. It doesn't mean we are being hogs - it just means that nothing 
else wants to run, so we happily accept the time.


On Oct 14, 2011, at 12:21 PM, Thatyene Louise Alves de Souza Ramos wrote:

> Does anyone have any idea?
> 
> ---
> Thatyene Ramos
> 
> On Fri, Oct 7, 2011 at 12:01 PM, Thatyene Louise Alves de Souza Ramos 
>  wrote:
> Hi there!
> 
> In my code I use MPI_Comm_accept in a server-client communication. I noticed 
> that the server remains on busy wait whereas waiting for clients connections, 
> using 100% of CPU if there are no other processes running.
> 
> I wonder if there is any way to prevent this from happening.
> 
> Thanks in advance.
> 
> Thatyene Ramos
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Comm_accept - Busy wait

2011-10-14 Thread Thatyene Louise Alves de Souza Ramos
Thank you for the explanation! I use "-mca mpi_yield_when_idle 1" already!

Thank you again!
---
Thatyene Ramos

On Fri, Oct 14, 2011 at 3:43 PM, Ralph Castain  wrote:

> Sorry - been occupied. This is normal behavior. As has been discussed on
> this list before, OMPI made a design decision to minimize latency. This
> means we aggressively poll for connections. Only thing you can do is tell it
> to yield the processor when idle so, if something else is trying to run, we
> will let it get in there a little earlier. Use -mca mpi_yield_when_idle 1
>
> However, we have seen that if no other user processes are trying to run,
> then the scheduler hands the processor right back to you - and you'll still
> see that 100% number. It doesn't mean we are being hogs - it just means that
> nothing else wants to run, so we happily accept the time.
>
>
> On Oct 14, 2011, at 12:21 PM, Thatyene Louise Alves de Souza Ramos wrote:
>
> Does anyone have any idea?
>
> ---
> Thatyene Ramos
>
> On Fri, Oct 7, 2011 at 12:01 PM, Thatyene Louise Alves de Souza Ramos <
> thaty...@gmail.com> wrote:
>
>> Hi there!
>>
>> In my code I use MPI_Comm_accept in a server-client communication. I
>> noticed that the server remains on busy wait whereas waiting for clients
>> connections, using 100% of CPU if there are no other processes running.
>>
>> I wonder if there is any way to prevent this from happening.
>>
>> Thanks in advance.
>>
>> Thatyene Ramos
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Error when using more than 88 processors for a specific executable -Abyss

2011-10-14 Thread Ashwani Kumar Mishra
Hi Ralph,
No idea how much this program consumes the numbers of file descriptors :(

Best Regards,
Ashwani

On Sat, Oct 15, 2011 at 12:08 AM, Ralph Castain  wrote:

> Should be plenty for us - does your program consume a lot?
>
>
> On Oct 14, 2011, at 12:25 PM, Ashwani Kumar Mishra wrote:
>
> Hi Ralph,
>
> fs.file-max = 10
>
> is this ok or less?
>
> Best Regards,
> Ashwani
>
>
> On Fri, Oct 14, 2011 at 11:45 PM, Ralph Castain  wrote:
>
>> Can't offer much about the qsub job. On the first one, what is your limit
>> on the number of file descriptors? Could be your sys admin has it too low.
>>
>>
>> On Oct 14, 2011, at 12:07 PM, Ashwani Kumar Mishra wrote:
>>
>> Hello,
>> When i try to run the following command i receive the following error when
>> i try to submit this job on the cluster having 40 nodes with each node
>> having 8 processor & 8 GB RAM:
>>
>> Both the command work well, as long as i use only upto 88 processors in
>> the cluster, but the moment i allocate more than 88 processors it gives me
>> the below 2 errors:
>>
>> I tried to set the ulimit to unlimited & setting mca parameter
>> opal_set_max_sys_limits to 1 but still the problem wont go.
>>
>>
>> $ mpirun=/*opt*/psc/ompi/bin/mpirun abyss-pe np=100 name=cattle k=50
>> n=10  in=s_1_1_sequence.txt
>>
>> /opt/mpi/openmpi/1.3.3/intel/
>> bin/mpirun -np 100 ABYSS-P -k50 -q3  --coverage-hist=coverage.hist -s
>> cattle-bubbles.fa  -o cattle-1.fa s_1_1_sequence.txt
>> [coe:19807 ] [[62863,0],0] ORTE_ERROR_LOG:
>> The system limit on number of pipes a process can open was reached in file
>> base/iof_base_setup.c at line 107
>> [coe.:19807 ] [[62863,0],0] ORTE_ERROR_LOG:
>> The system limit on number of pipes a process can open was reached in file
>> odls_default_module.c at line 203
>> [coe.:19807 ] [[62863,0],0] ORTE_ERROR_LOG:
>> The system limit on number of network connections a process can open was
>> reached in file oob_tcp.c at line 447
>> --
>> Error: system limit exceeded on number of network connections that can be
>> open
>>
>> This can be resolved by setting the mca parameter opal_set_max_sys_limits
>> to 1,
>> increasing your limit descriptor setting (using limit or ulimit commands),
>> or asking the system administrator to increase the system limit.
>> --
>> make: *** [cattle-1.fa] Error 1
>>
>>
>>
>> *
>> When i submit the same job through qsub, i receive the following error:*
>> $ qsub  -cwd -pe  orte 100 -o qsub.out -e qsub.err -b y -N  abyss `which
>> mpirun` /home/genome/abyss/bin/ABYSS-P -k 50 s_1_1_sequence.txt -o av
>>
>>
>> [compute-0-19.local][[28273,1]
>> ,125][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 173.16.255.231 failed: Connection refused (111)
>> [compute-0-19.local][[28273,1],127][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 173.16.255.231 failed: Connection refused (111)
>> [compute-0-23.local][[28273,1],135][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 173.16.255.228 failed: Connection refused (111)
>> [compute-0-23.local][[28273,1],133][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 173.16.255.228 failed: Connection refused (111)
>> [compute-0-4.local][[28273,1],113][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 173.16.255.231 failed: Connection refused (111)
>>
>>
>>
>> Best Regards,
>> Ashwani
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>