1024 is not the problem: changing it to 2048 hasn't change anything. Following your advice I've run my process using gdb. Unfortunately I didn't get anything more than:
Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xf7e4c6c0 (LWP 20246)] 0xf7f39905 in ompi_comm_set () from /home/gmaj/openmpi/lib/libmpi.so.0 (gdb) bt #0 0xf7f39905 in ompi_comm_set () from /home/gmaj/openmpi/lib/libmpi.so.0 #1 0xf7e3ba95 in connect_accept () from /home/gmaj/openmpi/lib/openmpi/mca_dpm_orte.so #2 0xf7f62013 in PMPI_Comm_connect () from /home/gmaj/openmpi/lib/libmpi.so.0 #3 0x080489ed in main (argc=825832753, argv=0x34393638) at client.c:43 What's more: when I've added a breakpoint on ompi_comm_set in 66th process and stepped a couple of instructions, one of the other processes crashed (as usualy on ompi_comm_set) earlier than 66th did. Finally I decided to recompile openmpi using -g flag for gcc. In this case the 66 processes issue has gone! I was running my applications exactly the same way as previously (even without recompilation) and I've run successfully over 130 processes. When switching back to the openmpi compilation without -g it again segfaults. Any ideas? I'm really confused. 2010/7/7 Ralph Castain <r...@open-mpi.org>: > I would guess the #files limit of 1024. However, if it behaves the same way > when spread across multiple machines, I would suspect it is somewhere in your > program itself. Given that the segfault is in your process, can you use gdb > to look at the core file and see where and why it fails? > > On Jul 7, 2010, at 10:17 AM, Grzegorz Maj wrote: > >> 2010/7/7 Ralph Castain <r...@open-mpi.org>: >>> >>> On Jul 6, 2010, at 8:48 AM, Grzegorz Maj wrote: >>> >>>> Hi Ralph, >>>> sorry for the late response, but I couldn't find free time to play >>>> with this. Finally I've applied the patch you prepared. I've launched >>>> my processes in the way you've described and I think it's working as >>>> you expected. None of my processes runs the orted daemon and they can >>>> perform MPI operations. Unfortunately I'm still hitting the 65 >>>> processes issue :( >>>> Maybe I'm doing something wrong. >>>> I attach my source code. If anybody could have a look on this, I would >>>> be grateful. >>>> >>>> When I run that code with clients_count <= 65 everything works fine: >>>> all the processes create a common grid, exchange some information and >>>> disconnect. >>>> When I set clients_count > 65 the 66th process crashes on >>>> MPI_Comm_connect (segmentation fault). >>> >>> I didn't have time to check the code, but my guess is that you are still >>> hitting some kind of file descriptor or other limit. Check to see what your >>> limits are - usually "ulimit" will tell you. >> >> My limitations are: >> time(seconds) unlimited >> file(blocks) unlimited >> data(kb) unlimited >> stack(kb) 10240 >> coredump(blocks) 0 >> memory(kb) unlimited >> locked memory(kb) 64 >> process 200704 >> nofiles 1024 >> vmemory(kb) unlimited >> locks unlimited >> >> Which one do you think could be responsible for that? >> >> I was trying to run all the 66 processes on one machine or spread them >> across several machines and it always crashes the same way on the 66th >> process. >> >>> >>>> >>>> Another thing I would like to know is if it's normal that any of my >>>> processes when calling MPI_Comm_connect or MPI_Comm_accept when the >>>> other side is not ready, is eating up a full CPU available. >>> >>> Yes - the waiting process is polling in a tight loop waiting for the >>> connection to be made. >>> >>>> >>>> Any help would be appreciated, >>>> Grzegorz Maj >>>> >>>> >>>> 2010/4/24 Ralph Castain <r...@open-mpi.org>: >>>>> Actually, OMPI is distributed with a daemon that does pretty much what you >>>>> want. Checkout "man ompi-server". I originally wrote that code to support >>>>> cross-application MPI publish/subscribe operations, but we can utilize it >>>>> here too. Have to blame me for not making it more publicly known. >>>>> The attached patch upgrades ompi-server and modifies the singleton startup >>>>> to provide your desired support. This solution works in the following >>>>> manner: >>>>> 1. launch "ompi-server -report-uri <filename>". This starts a persistent >>>>> daemon called "ompi-server" that acts as a rendezvous point for >>>>> independently started applications. The problem with starting different >>>>> applications and wanting them to MPI connect/accept lies in the need to >>>>> have >>>>> the applications find each other. If they can't discover contact info for >>>>> the other app, then they can't wire up their interconnects. The >>>>> "ompi-server" tool provides that rendezvous point. I don't like that >>>>> comm_accept segfaulted - should have just error'd out. >>>>> 2. set OMPI_MCA_orte_server=file:<filename>" in the environment where you >>>>> will start your processes. This will allow your singleton processes to >>>>> find >>>>> the ompi-server. I automatically also set the envar to connect the MPI >>>>> publish/subscribe system for you. >>>>> 3. run your processes. As they think they are singletons, they will detect >>>>> the presence of the above envar and automatically connect themselves to >>>>> the >>>>> "ompi-server" daemon. This provides each process with the ability to >>>>> perform >>>>> any MPI-2 operation. >>>>> I tested this on my machines and it worked, so hopefully it will meet your >>>>> needs. You only need to run one "ompi-server" period, so long as you >>>>> locate >>>>> it where all of the processes can find the contact file and can open a TCP >>>>> socket to the daemon. There is a way to knit multiple ompi-servers into a >>>>> broader network (e.g., to connect processes that cannot directly access a >>>>> server due to network segmentation), but it's a tad tricky - let me know >>>>> if >>>>> you require it and I'll try to help. >>>>> If you have trouble wiring them all into a single communicator, you might >>>>> ask separately about that and see if one of our MPI experts can provide >>>>> advice (I'm just the RTE grunt). >>>>> HTH - let me know how this works for you and I'll incorporate it into >>>>> future >>>>> OMPI releases. >>>>> Ralph >>>>> >>>>> >>>>> On Apr 24, 2010, at 1:49 AM, Krzysztof Zarzycki wrote: >>>>> >>>>> Hi Ralph, >>>>> I'm Krzysztof and I'm working with Grzegorz Maj on this our small >>>>> project/experiment. >>>>> We definitely would like to give your patch a try. But could you please >>>>> explain your solution a little more? >>>>> You still would like to start one mpirun per mpi grid, and then have >>>>> processes started by us to join the MPI comm? >>>>> It is a good solution of course. >>>>> But it would be especially preferable to have one daemon running >>>>> persistently on our "entry" machine that can handle several mpi grid >>>>> starts. >>>>> Can your patch help us this way too? >>>>> Thanks for your help! >>>>> Krzysztof >>>>> >>>>> On 24 April 2010 03:51, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> >>>>>> In thinking about this, my proposed solution won't entirely fix the >>>>>> problem - you'll still wind up with all those daemons. I believe I can >>>>>> resolve that one as well, but it would require a patch. >>>>>> >>>>>> Would you like me to send you something you could try? Might take a >>>>>> couple >>>>>> of iterations to get it right... >>>>>> >>>>>> On Apr 23, 2010, at 12:12 PM, Ralph Castain wrote: >>>>>> >>>>>>> Hmmm....I -think- this will work, but I cannot guarantee it: >>>>>>> >>>>>>> 1. launch one process (can just be a spinner) using mpirun that includes >>>>>>> the following option: >>>>>>> >>>>>>> mpirun -report-uri file >>>>>>> >>>>>>> where file is some filename that mpirun can create and insert its >>>>>>> contact info into it. This can be a relative or absolute path. This >>>>>>> process >>>>>>> must remain alive throughout your application - doesn't matter what it >>>>>>> does. >>>>>>> It's purpose is solely to keep mpirun alive. >>>>>>> >>>>>>> 2. set OMPI_MCA_dpm_orte_server=FILE:file in your environment, where >>>>>>> "file" is the filename given above. This will tell your processes how to >>>>>>> find mpirun, which is acting as a meeting place to handle the >>>>>>> connect/accept >>>>>>> operations >>>>>>> >>>>>>> Now run your processes, and have them connect/accept to each other. >>>>>>> >>>>>>> The reason I cannot guarantee this will work is that these processes >>>>>>> will all have the same rank && name since they all start as singletons. >>>>>>> Hence, connect/accept is likely to fail. >>>>>>> >>>>>>> But it -might- work, so you might want to give it a try. >>>>>>> >>>>>>> On Apr 23, 2010, at 8:10 AM, Grzegorz Maj wrote: >>>>>>> >>>>>>>> To be more precise: by 'server process' I mean some process that I >>>>>>>> could run once on my system and it could help in creating those >>>>>>>> groups. >>>>>>>> My typical scenario is: >>>>>>>> 1. run N separate processes, each without mpirun >>>>>>>> 2. connect them into MPI group >>>>>>>> 3. do some job >>>>>>>> 4. exit all N processes >>>>>>>> 5. goto 1 >>>>>>>> >>>>>>>> 2010/4/23 Grzegorz Maj <ma...@wp.pl>: >>>>>>>>> Thank you Ralph for your explanation. >>>>>>>>> And, apart from that descriptors' issue, is there any other way to >>>>>>>>> solve my problem, i.e. to run separately a number of processes, >>>>>>>>> without mpirun and then to collect them into an MPI intracomm group? >>>>>>>>> If I for example would need to run some 'server process' (even using >>>>>>>>> mpirun) for this task, that's OK. Any ideas? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Grzegorz Maj >>>>>>>>> >>>>>>>>> >>>>>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>: >>>>>>>>>> Okay, but here is the problem. If you don't use mpirun, and are not >>>>>>>>>> operating in an environment we support for "direct" launch (i.e., >>>>>>>>>> starting >>>>>>>>>> processes outside of mpirun), then every one of those processes >>>>>>>>>> thinks it is >>>>>>>>>> a singleton - yes? >>>>>>>>>> >>>>>>>>>> What you may not realize is that each singleton immediately >>>>>>>>>> fork/exec's an orted daemon that is configured to behave just like >>>>>>>>>> mpirun. >>>>>>>>>> This is required in order to support MPI-2 operations such as >>>>>>>>>> MPI_Comm_spawn, MPI_Comm_connect/accept, etc. >>>>>>>>>> >>>>>>>>>> So if you launch 64 processes that think they are singletons, then >>>>>>>>>> you have 64 copies of orted running as well. This eats up a lot of >>>>>>>>>> file >>>>>>>>>> descriptors, which is probably why you are hitting this 65 process >>>>>>>>>> limit - >>>>>>>>>> your system is probably running out of file descriptors. You might >>>>>>>>>> check you >>>>>>>>>> system limits and see if you can get them revised upward. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Apr 17, 2010, at 4:24 PM, Grzegorz Maj wrote: >>>>>>>>>> >>>>>>>>>>> Yes, I know. The problem is that I need to use some special way for >>>>>>>>>>> running my processes provided by the environment in which I'm >>>>>>>>>>> working >>>>>>>>>>> and unfortunately I can't use mpirun. >>>>>>>>>>> >>>>>>>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>: >>>>>>>>>>>> Guess I don't understand why you can't use mpirun - all it does is >>>>>>>>>>>> start things, provide a means to forward io, etc. It mainly sits >>>>>>>>>>>> there >>>>>>>>>>>> quietly without using any cpu unless required to support the job. >>>>>>>>>>>> >>>>>>>>>>>> Sounds like it would solve your problem. Otherwise, I know of no >>>>>>>>>>>> way to get all these processes into comm_world. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Apr 17, 2010, at 2:27 PM, Grzegorz Maj wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> I'd like to dynamically create a group of processes communicating >>>>>>>>>>>>> via >>>>>>>>>>>>> MPI. Those processes need to be run without mpirun and create >>>>>>>>>>>>> intracommunicator after the startup. Any ideas how to do this >>>>>>>>>>>>> efficiently? >>>>>>>>>>>>> I came up with a solution in which the processes are connecting >>>>>>>>>>>>> one by >>>>>>>>>>>>> one using MPI_Comm_connect, but unfortunately all the processes >>>>>>>>>>>>> that >>>>>>>>>>>>> are already in the group need to call MPI_Comm_accept. This means >>>>>>>>>>>>> that >>>>>>>>>>>>> when the n-th process wants to connect I need to collect all the >>>>>>>>>>>>> n-1 >>>>>>>>>>>>> processes on the MPI_Comm_accept call. After I run about 40 >>>>>>>>>>>>> processes >>>>>>>>>>>>> every subsequent call takes more and more time, which I'd like to >>>>>>>>>>>>> avoid. >>>>>>>>>>>>> Another problem in this solution is that when I try to connect >>>>>>>>>>>>> 66-th >>>>>>>>>>>>> process the root of the existing group segfaults on >>>>>>>>>>>>> MPI_Comm_accept. >>>>>>>>>>>>> Maybe it's my bug, but it's weird as everything works fine for at >>>>>>>>>>>>> most >>>>>>>>>>>>> 65 processes. Is there any limitation I don't know about? >>>>>>>>>>>>> My last question is about MPI_COMM_WORLD. When I run my processes >>>>>>>>>>>>> without mpirun their MPI_COMM_WORLD is the same as MPI_COMM_SELF. >>>>>>>>>>>>> Is >>>>>>>>>>>>> there any way to change MPI_COMM_WORLD and set it to the >>>>>>>>>>>>> intracommunicator that I've created? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Grzegorz Maj >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> <client.c><server.c>_______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >