Re: [OMPI users] another mpirun + xgrid question
If you are using scheduler like PBS or SGE over MPI, there is an option called prolog and epilog, where you can give scripts which does copy operation. This script is called before and after job execution as the name suggests. Without it, in mpi itself, i have to see, if it can be done. The alternative way is to keep copy of the program at the same location on all compute nodes and launch mpirun. If the executable location is different on compute nodes, you have to specify the same as the mpirun command-line arguments. On Mon, 2007-09-10 at 15:35 -0400, Lev Givon wrote: > When launching an MPI program with mpirun on an xgrid cluster, is > there a way to cause the program being run to be temporarily copied to > the compute nodes in the cluster when executed (i.e., similar to what the > xgrid command line tool does)? Or is it necessary to make the program > being run available on every compute node (e.g., using NFS data > partions)? > > L.G. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments contained in it. Contact your Administrator for further information.
[OMPI users] libnbc compilation
Hello Everyone, I was checking the development version from svn and found that support for libnbc is going to come in next release. I thought of compiling it, but failed to do.Could some one suggest me how to get it compiled.When i made changes to configure script(Basically added some flags), its giving output saying, libnbc can\'t be compiled.any help would be appreciatedregardsNeeraj
[OMPI users] Query regarding GPR
Hi everybody, I have a doubt regarding ORTE. One of the major functionality of orte is to maintain GPR, which subscribes and publishes information to the universe. I have a doubt saying, when we submit job from a machine, where does GPR gets created? Is it on the submit machine (HNP)?if YES, then how does compute node gets the information of the same during execution ? Does it use OOB for it ?-Neeraj
[OMPI users] Tuning Openmpi with IB Interconnect
Dear All, Could anyone tell me the important tuning parameters in openmpi with IB interconnect? I tried setting eager_rdma, min_rdma_size, mpi_leave_pinned parameters from the mpirun command line on 38 nodes cluster (38*2 processors) but in vain. I found simple mpirun with no mca parameters performing better. I conducted test on P2P send/receive with data size of 8MB. Similarly i patched HPL linpack code with libnbc(non blocking collectives) and found no performance benefits. I went through its patch and found that, its probably not overlapping computation with communication.Any help in this direction would be appreciated.-Neeraj
[OMPI users] Re :Re: Tuning Openmpi with IB Interconnect
Hi, The code was pretty simple. I was trying to send 8MB data from one rank to other in a loop(say 1000 iterations). And then i was taking the average of time taken and was calculating the bandwidth.The above logic i tried with both mpirun-with-mca-parameters and without any parameters. And to my surprise, the performance was degrading when i was trying to manipulate.Now I have another question in mind. Is it possible to have IB Hardware Multicast implementation in OpenMPI? I have gone through the issues/challenges for the same, but also read couple of people who have successfully done it for Ethernet/Giga-bit Ethernet and IPoIB ofcourse in experimental stage. Actually i want to contribute for it in OpenMPI and need the help for the same.-NeerajOn Thu, 11 Oct 2007 12:01:39 +0200 Open MPI Users wrote Hi Neeraj, > Could anyone tell me the important tuning parameters in openmpi with > IB interconnect? I tried setting eager_rdma, min_rdma_size, > mpi_leave_pinned parameters from the mpirun command line on 38 nodes > cluster (38*2 processors) but in vain. I found simple mpirun with no mca > parameters performing better. I conducted test on P2P send/receive with > data size of 8MB. The performance of the BTL with different parameters depends heavily on the code that you run. E.g., leave_pinned works very well with many microbenchmarks (e.g., bandwidth/overlap-wise) but may not perform well with real applications that use different memory regions. It\'s pretty much the same with the other parameters. The default values are considered best for many applications. Can you provide us any details about the code you\'re runnning to test performance? >Similarly i patched HPL linpack code with libnbc(non blocking >collectives) and found no performance benefits. I went through its patch >and found that, its probably not overlapping computation with >communication. Ah, so there are two things. LibNBC provides overlap, most overlap is achieved if memory regions are reused and leave_pinned is activated. But again, this is highly application-dependent. However, the patch for the Linpack code (I guess you refer to the patch from the LibNBC webpage [1]) is in experimental stage (as the website says) and is not properly tested for performance benefit. The original HPL provides something like a broadcast start and broadcast end phase. I just replaced them with non-blocking calls to NBC_Ibcast() and did not find the time to do any performance/code analysis yet. Any input by HPL experts is appreciated!Best,Torsten[1]: http://www.unixer.de/research/nbcoll/hpl/--bash$ :(){ :|:&};: - http://www.unixer.de/ - \"Software Engineering is that part of Computer Science which is too difficult for the Computer Scientist.\" ~ F. L. Bauer ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Re :Re: Re :Re: Tuning Openmpi with IB Interconnect
Yes, the buffer was being re-used. No we didnt try to benchmark it with netpipe and other stuffs. But the program was pretty simple. Do you think, I need to test it with bigger chunks (>8MB) for communication.?We also tried manipulating eager_limit and min_rdma_sze, but no success.NeerajOn Fri, 12 Oct 2007 13:00:10 +0200 Open MPI Users wrote Hello, >The code was pretty simple. I was trying to send 8MB data from one >rank to other in a loop(say 1000 iterations). And then i was taking the >average of time taken and was calculating the bandwidth. > >The above logic i tried with both mpirun-with-mca-parameters and without >any parameters. And to my surprise, the performance was degrading when i >was trying to manipulate. That sounds strange. So did you re-use the communication buffers? Did you try to run some existing benchmarks like Netpipe [1], IMB or Netgauge [2]?>Now I have another question in mind. Is it possible to have IB Hardware >Multicast implementation in OpenMPI? I have gone through the >issues/challenges for the same, but also read couple of people who have >successfully done it for Ethernet/Giga-bit Ethernet and IPoIB ofcourse in >experimental stage. Actually i want to contribute for it in OpenMPI and >need the help for the same. As far as I know, there are two groups/people working on this. Andy Friedley implements a \"traditional\" ACK based approach (like the one that the OSU folks published about some time ago) and I implemented a new idea for extreme scale (see \"A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast\" [3]). I know that my version is still unstable and has some problems. But I\'m working on this.Best,Torsten[1]: http://www.scl.ameslab.gov/netpipe/ [2]: http://www.unixer.de/research/netgauge/ [3]: https://www.unixer.de/publications/#hoefler-cac07--bash$ :(){ :|:&};: - http://www.unixer.de/ - Computer scientists are the historians of computing. -- Gordon Bell ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Compile test programs
Hi all, Could someone suggest me, how to compile programs given in test directory of the source code? There are couple of directories within test which contains sample programs about the usage of datastructures being used by open-MPI. I am able to compile some of the directories at it was having Makefile created on running configure script, but few of them like runtime doesn\'t have the Makefile.Please help me compiling it.-Neeraj
[OMPI users] OpenMPI 1.2.4 vs 1.2
Hello Guys, I had openmpi v1.2 installed on my cluster. Couple of days back, i thought to upgrade it to v1.2.4(latest release i suppose). Since i didnt want to take risk, i first installed it on temporary location and did bandwidth and bidirectional bandwidth test provided by the OSU guys, and to my surprise, old version performs better in both scenarios.Could anyone give me the reason for the same?I repeated the above point to point tests between all set of nodes, but the result were same :(-Neeraj
[OMPI users] Re :Re: Process 0 with different time executing the same code
Hi, Please ensure if following things are correct1) The array bounds are equal. Means \"my_x\" and \"size_y\" has the same value on all nodes.2) Nodes are homogenous. To check that, you could decide root to be some different node and run the program-NeerajOn Fri, 26 Oct 2007 10:13:15 +0500 (PKT) Open MPI Users wrote Thanks for your reply, I used MPI_Wtime for my application but even then process 0 took longer time executing the mentioned code segment. I might be worng, but what I see is process 0 takes more time to access the array elements than other processes. Now I dont see what to do because the mentioned code segment is creating a bottleneck for the timing of my application.Can any one suggest somthing in this regard. I will be very thankfulregardsAftab Hussain On Thu, October 25, 2007 9:38 pm, jody wrote: > HI > I\'m not sure if that is a problem, > but in MPI applications you shoud use MPI_WTime() for time-measurements > > Jody > > > On 10/25/07, 42af...@niit.edu.pk wrote: > >> Hi all, >> I am a research assistant (RA) at NUST Pakistan in High Performance >> Scientific Computing Lab. I am working on the parallel >> implementation of the Finitie Difference Time Domain (FDTD) method using >> MPI. I am using the OpenMPI environment on a cluster of 4 >> SunFire v890 cluster connected through Myrinet. I am having problem >> that when I run my code with let say 4 processes. Process 0 takes about 3 >> times more time than other three processes, executing a for loop which >> is the main cause of load imbalance in my code. I am writing the code >> that is causing the problem. The code is run by all the processes >> simultaneously and independently and I have timed it independent of >> segments of code. >> >> start = gethrtime(); for (m = 1; m < my_x ; m++){ for (n = 1; n > size_y-1; n++) { Ez(m,n) = Ez(m,n) + cezh*((Hy(m,n) - Hy(m- 1,n)) - >> (Hx(m,n) - Hx(m,n-1))); >> } >> } >> stop = gethrtime(); time = (stop-start); >> >> In my implementation I used 1-D array to realize 2-D arrays.I have used >> the following macros for accesing the array elements. >> >> #define Hx(I,J) hx[(I)*(size_y) + (J)] >> #define Hy(I,J) hy[(I)*(size_y) + (J)] >> #define Ez(I,J) ez[(I)*(size_y) + (J)] >> >> >> Can any one tell me what am I doing wrong here, or macros are creating >> the problems or it can be related to any OS issue. I will be looking >> forward for help because this problem has stopped my progress for the >> last two weeks >> >> regards aftab hussain >> >> RA High Performance Scientific Computing Lab >> >> >> NUST Institue of Information Technology >> >> >> National University of Sciences and Technology Pakistan >> >> >> >> >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is believed to be clean. >> >> ___ >> users mailing list us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > ___ > users mailing list us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is believed to be clean. > >-- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] MPI_Send issues with openib btl
hi, We are facing some problem when calling MPI_Send over IB. The problem looks similar to ticket https://svn.open-mpi.org/trac/ompi/ticket/232, but this time its for IB Interface. When forcefully running the program using --mca btl tcp,self its running fine. On Ib, its giving error messages like local protocol error, flush error, invalid request error, local length error kind of messages.Any help would be appreciated.-Neeraj
[OMPI users] OpenMP and OpenMPI Issue
Hi folks, I have been seeing some nasty behaviour in MPI_Send/Recv with large dataset(8 MB), when used with OpenMP and Openmpi together with IB Interconnect. Attached is a program. The code first calls MPI_Init_thread() followed by openmp thread creation API. The program works fine, if we do single side comm unication [Thread 0 of process 0 sending some data to any thread of process 1], but it hangs if both side tries to send some data (8 MB) using IB Interconnect Interesting to note that program works fine, if we send short data(1 MB or below). I see this with openmpi-1.2 or openmpi-1.2.4 (compiled with --enable-mpi-threads) ofed 1.2 2.6.9-42.4sp.XCsmp icc (Intel Compiler) compiled as mpicc -O3 -openmp temp.c run as mpirun -np 2 -hostfile nodelist a.out The error i am getting is -- [0,1,1][btl_openib_component.c:1199:btl_openib_component_progress] from n129 to: n115 error polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for wr_id 6391728 opcode 0[0,1,1][btl_openib_component.c:1199:btl_openib_component_progress] from n129 to: n115 error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 7058304 opcode 128[0,1,0][ btl_openib_component.c:1199:btl_openib_component_progress] from n115 to: n129 [0,1,0][btl_openib_component.c:1199:btl_openib_component_progress] from n115 to: n129 error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 6854256 opcode 128error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 6920112 opcode 0 --- Anyone else seeing similar? Any ideas for workarounds? As a point of reference, program works fine, if we force openmpi to select TCP interconnect using --mca btl tcp,self.-Neeraj #include #include #include #include #include #include "time.h" #include #define MAX 100 int main(int argc, char *argv[]) { int required = MPI_THREAD_MULTIPLE; int provided; int rank; int size; int id; int flag; MPI_Status status; double *buff1, *buff2; MPI_Init_thread(&argc, &argv, required, &provided); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); buff1 = (double *)malloc(sizeof(double)*MAX); buff2 = (double *)malloc(sizeof(double)*MAX); omp_set_num_threads(2); #pragma omp parallel private(id) { id = omp_get_thread_num(); if(rank == 0) { if(id == 0) MPI_Send(buff1, MAX ,MPI_DOUBLE, 1, rank, MPI_COMM_WORLD); else MPI_Recv(buff2, MAX, MPI_DOUBLE, 1, 1234, MPI_COMM_WORLD, &status); } if(rank == 1) { if(id == 0) MPI_Recv(buff1, MAX, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status); else MPI_Send(buff2, MAX ,MPI_DOUBLE, 0, 1234, MPI_COMM_WORLD); } } printf("rank = %d %d \n", rank, provided); free(buff1); free(buff2); MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); }
[OMPI users] Re :Re: OpenMP and OpenMPI Issue
Thanks for your reply, but the program is running on TCP interconnect with same datasize and also on IB with small datasize say 1MB. So i dont think problem is in OpenMPI, it has to do something with IB logic, which probably doesnt work well with threads.I also tried the program with MPI_THREAD_SERIALIZED, but in vain. When is the version 1.3 scheduled to be released? Would it fix such issues?Correct me, if i am wrong-NeerajOn Wed, 31 Oct 2007 05:31:32 -0700 Open MPI Users wrote THREAD_MULTIPLE support does not work in the 1.2 series. Try turningit off. On Oct 30, 2007, at 12:17 AM, Neeraj Chourasia wrote:> Hi folks, > > I have been seeing some nasty behaviour in MPI_Send/Recv> with large dataset(8 MB), when used with OpenMP and Openmpi> together with IB Interconnect. Attached is a program. > >The code first calls MPI_Init_thread() followed by openmp> thread creation API. The program works fine, if we do single side > comm unication [Thread 0 of process 0 sending some data to any> thread of process 1], but it hangs if both side tries to send some> data (8 MB) using IB Interconnect > > Interesting to note that program works fine, if we send> short data(1 MB or below). > > I see this with > > openmpi-1.2 or openmpi-1.2.4 (compiled with --enable-mpi- > threads) > ofed 1.2 > 2.6.9-42.4sp.XCsmp > icc (Intel Compiler) > > compiled as > mpicc -O3 -openmp temp.c > run as > mpirun -np 2 -hostfile nodelist a.out > > The error i am getting is > > -- > -- > -- > > [0,1,1][btl_openib_component.c: > 1199:btl_openib_component_progress] fr om n129 to: n115 error> polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for> wr_id 6391728 opcode 0 > [0,1,1][btl_openib_component.c:1199:btl_openib_component_progress]> from n129 to: n115 error polling LP CQ with status WORK REQUEST> FLUSHED ERROR status number 5 for wr_id 7058304 opcode 128 > [0,1,0][btl_openib_component.c:1199:btl_openib_component_progress]> from n115 to: n129 [0,1,0][btl_openib_component.c: > 1199:btl_openib_component_progress] from n115 to: n129 error> polling LP CQ with status WORK REQUEST FLUSHED ERROR status number> 5 for wr_id 6854256 opcode 128 > error polling LP CQ with status LOCAL LENGTH ERROR status number 1> for wr_id 6920112 opcode 0 > >> -- > -- > --- > > > Anyone else seeing similar? Any ideas for workarounds? > As a point of reference, program works fine, if we force> openmpi to select TCP interconnect using --mca btl tcp,self. > > -Neeraj > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Adding new API
Hello Everyone, I just want to add extra API to be used by application guys. This API can be called from C application and has to be compiled and linked by MPICC. But i am getting undefined references, even though i am exporting it in the source code. Could some one tell me the steps, i should be considerate about?-Neeraj
[OMPI users] version 1.3
Hello Guys, When is the version 1.3 scheduled to be released? As it would contain checkpointing, library for non-blocking communication, ConnectX for QP's, it would be great to have it ASAP. Since i am evaluating MVAPICH against OpenMPI, i found that MVAPICH still has upper hand in terms of checkpointing. But i am pretty sure, once v1.3 will come, it will help a lot to HPC community. I can find the development trunk version, but i am more interested in production release version. -Neeraj
Re: [OMPI users] OpenIB problems
Hi Guys, The alternative to THREAD_MULTIPLE problem is to use --mca mpi_leave_pinned 1 to mpirun option. This will ensure 1 RDMA operation contrary to splitting data in MAX RDMA size (default to 1MB). If your data size is small say below 1 MB, program will run well with THREAD_MULTIPLE. Problem comes when data size increases and OpenMPI starts splitting it. I think even with Bigger sizes, Program works if interconnect is TCP, but fails to work on IB. So on IB, you can run your program if you set mca paramter mpi_leave_pinned to 1. Cheers Neeraj On Thu, 29 Nov 2007 Brock Palen wrote : >Jeff thanks for all the reply's, > >Hate to admit but at the moment we can't log onto the switch. > >But the ibcheckerrors command returns nothing out of bounds, and i >think that command also checks the switch ports. > >Thanks, we will do some tests > >Brock Palen >Center for Advanced Computing >bro...@umich.edu >(734)936-1985 > > >On Nov 27, 2007, at 4:50 PM, Jeff Squyres wrote: > > > Sorry for jumping in late; the holiday and other travel prevented me > > from getting to all my mail recently... :-\ > > > > Have you checked the counters on the subnet manager to see if any > > other errors are occurring? It might be good to clear all the > > counters, run the job, and see if the counters are increasing faster > > than they should (i.e., any particular counter should advance very > > very slowly -- perhaps 1 per day or so). > > > > I'll ask around the kernel-level guys (i.e., Roland) to see what else > > could cause this kind of error. > > > > > > > > On Nov 27, 2007, at 3:35 PM, Brock Palen wrote: > > > >> Ok i will open a case with cisco, > >> > >> > >> Brock Palen > >> Center for Advanced Computing > >> bro...@umich.edu > >> (734)936-1985 > >> > >> > >> On Nov 27, 2007, at 4:19 PM, Andrew Friedley wrote: > >> > >>> > >>> > >>> Brock Palen wrote: > >> What would be a place to look? Should this just be default then > >> for > >> OMPI? ompi_info shows the default as 10 seconds? Is that right > >> 'seconds' ? > > The other IB guys can probably answer better than I can -- I'm > > not an > > expert in this part of IB (or really any part I guess :). Not > > sure > > why > > a larger value isn't the default. No, its not seconds -- check > > the > > description of the MCA parameter: > > > > 4.096 microseconds * (2^btl_openib_ib_timeout) > > You sure? > ompi_info --param btl openib > > MCA btl: parameter "btl_openib_ib_timeout" (current value: "10") > InfiniBand transmit timeout, in seconds > (must be >= 1) > >>> > >>> Yeah: > >>> > >>> MCA btl: parameter "btl_openib_ib_timeout" (current value: "10") > >>> InfiniBand transmit timeout, plugged into formula: > >>> 4.096 microseconds * (2^btl_openib_ib_timeout)(must be > = 0 and <= 31) > >>> > >>> Reading earlier in the thread you said OMPI v1.2.0, I got this > >>> from a > >>> trunk checkout thats around 3 weeks old. A quick check shows this > >>> description was changed between 1.2.0 and 1.2.1. However the use of > >>> this parameter hasn't changed -- it's simply passed along to IB > >>> verbs > >>> when creating a queue pair (aka a connection). > >>> > >>> Andrew > >>> ___ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >> > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > Cisco Systems > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > >___ >users mailing list >us...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] what is MPI_IN_PLACE
Hello everyone, While going through collective algorithms, I came across preprocessor directive MPI_IN_PLACE which is (void *)1. Its always being compared against source buffer(sbuf). My question is when MPI_IN_PLACE == sbuf condition would be true. As far as i understand, sbuf is the address of source buffer, which every node has to transfer to remaining nodes based on recursive doubling or say bruck algo. And it can never be equal to (void *)1. Any help is appreciated.RegardsNeeraj
[OMPI users] Re :Re: what is MPI_IN_PLACE
Thanks George, But what is the need for user to specify it. The api can check the address of input buffers and output buffers. Is there some extra advantage of MPI_IN_PLACE over automatically detecting it using pointers?-NeerajOn Tue, 11 Dec 2007 06:10:06 -0500 Open MPI Users wrote Neeraj,MPI_IN_PLACE is defined by the MPI standard in order to allow theusers to specify that the input and output buffers for the collectivesare the same. Moreover, not all collectives support MPI_IN_PLACE andfor those that support it some strict rules apply. Please read the collective section in the MPI standard to see all the restrictions. Thanks, george.On Dec 11, 2007, at 5:56 AM, Neeraj Chourasia wrote: > Hello everyone, > > While going through collective algorithms, I came across> preprocessor directive MPI_IN_PLACE which is (void *)1. Its always> being compared against source buffer(sbuf). My question is when> MPI_IN_PLACE == sbuf condition would be true. As far as i> understand, sbuf is the address of source buffer, which every node > has to transfer to remaining nodes based on recursive doubling or > say bruck algo. And it can never be equal to (void *)1. Any help is > appreciated. > > Regards > Neeraj > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] orte in persistent mode
Dear All, I am wondering if ORTE can be run in persistent mode. It has already been raised in Mailing list ( http://www.open-mpi.org/community/lists/users/2006/03/0939.php), where it was said that the problem is still there. I just want to know, if its fixed or being fixed ? Reason, why i am looking at is in large clusters, mpirun takes lot of time starting orted (by ssh) on remote nodes. If orte is already running, hopefully we can save considerable time. Any comments is appreciated. -Neeraj
[OMPI users] Openmpi with SGE
Hello everyone, I am facing problem while calling mpirun in a loop when using with SGE. My sge version is SGE6.1AR_snapshot3. The script i am submitting via sge is xlet i=0while [ $i -lt 100 ]do echo "" echo "Iteration :$i" /usr/local/openmpi-1.2.4/bin/mpirun -np $NP -hostfile $TMP/machines send let "i+=1" echo ""doneNow above script runs well for 15-20 iteration and then fails with following message-Error Message---error: executing task of job 3869 failed: execution daemon on host "n101" didn't accept task[n199:11989] ERROR: A daemon on node n101 failed to start as expected.[n199:11989] ERROR: There may be more information available from[n199:11989] ERROR: the 'qstat -t' command on the Grid Engine tasks.[n199:11989] ERROR: If the problem persists, please restart the[n199:11989] ERROR: Grid Engine PE job[n199:11989] ERROR: The daemon exited unexpectedly with status 1.---When i do ssh to n101, there is no orted and qrsh_starter running. While checking its spool file, i came across following message---Execd spool Error Message-|execd|n101|E|no free queue for job 3869 of user neeraj@n199 (localhost = n101)--- What could be the reason for it.While checking the mailing list, i come across following link http://www.open-mpi.org/community/lists/users/2007/03/2771.phpbut, i dont think its the same problem. Any help is appreciated.RegardsNeeraj
[OMPI users] RDMA-CM
Hello everyone, I downloaded openmpi-1.3 version from night tarballs to check RDMA-CM support. I am able to compile and install it, but dont know how to run it as there is no documentation provided. Did someone try running it with OpenMPI?My another question is Does OpenMPI1.3 has progress-threads support for IB? Because while compiling with that option, it didnt give me any warnings or failure unlike openmpi1.2.X series does.RegardsNeeraj
[OMPI users] Re :Re: Linpack Benchmark and File Descriptor Limits
Hello, With openmpi-1.3, new mca feature is introduced namely --mca routed binomial. This ensures out of band communication to happen in binomial fashion and reduces the net socket opening and hence solves file open issues.-NeerajOn Thu, 18 Sep 2008 16:46:23 -0700 Open MPI Users wrote I'm just running it using mpirun from the command line. Thanks for the reply. On Thu, Sep 18, 2008 at 4:35 PM, John Hearns wrote:2008/9/18 Alex Wolfe Hello,I am trying to run the HPL benchmarking software on a new 1024 core cluster that we have set up. Unfortunately I'm hitting the "mca_oob_tcp_accept: accept() failed: Too many open files (24)" error known in verson 1.2 of openmpi. No matter what I set the file-descriptor limit for my account to, I am still limited to only 808 or so processes. Does anyone have any suggestions? Are you running the Linpack via a batch system or just using mpirun from the command line?If via a batch system, looks for FAQs on how to set the resource limits for that batch system. ___users mailing list us...@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users