Ralph, I'm able to use the generic module when the processes are on different machines.
what would be the values of the EV when two processes are on the same machine (hopefully talking over SHM). i've played with combination of nodelist and ppn but no luck. I get errors like: [c0301b10e1:03172] [[0,9999],1] -> [[0,0],0] (node: c0301b10e1) oob-tcp: Number of attempts to create TCP connection has been exceeded. Can not communicate with peer [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file grpcomm_hier_module.c at line 303 [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file base/grpcomm_base_modex.c at line 470 [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file grpcomm_hier_module.c at line 484 -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Unreachable" (-12) instead of "Success" (0) -------------------------------------------------------------------------- *** The MPI_Init() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [c0301b10e1:3172] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! maybe a related question is how to assign the TCP port range and how is it used? when the processes are on different machines, I use the same range and that's ok as long as the range is free. but when the processes are on the same node, what value should the range be for each process? My range is 10000-12000 (for both processes) and I see that process with rank #0 listen on port 10001 while process with rank #1 try to establish a connect to port 10000. Thanks so much! p. still here... still trying... ;-) On Tue, Jul 27, 2010 at 12:58 AM, Ralph Castain <r...@open-mpi.org> wrote: > Use what hostname returns - don't worry about IP addresses as we'll discover > them. > > On Jul 26, 2010, at 10:45 PM, Philippe wrote: > >> Thanks a lot! >> >> now, for the ev "OMPI_MCA_orte_nodes", what do I put exactly? our >> nodes have a short/long name (it's rhel 5.x, so the command hostname >> returns the long name) and at least 2 IP addresses. >> >> p. >> >> On Tue, Jul 27, 2010 at 12:06 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> Okay, fixed in r23499. Thanks again... >>> >>> >>> On Jul 26, 2010, at 9:47 PM, Ralph Castain wrote: >>> >>>> Doh - yes it should! I'll fix it right now. >>>> >>>> Thanks! >>>> >>>> On Jul 26, 2010, at 9:28 PM, Philippe wrote: >>>> >>>>> Ralph, >>>>> >>>>> i was able to test the generic module and it seems to be working. >>>>> >>>>> one question tho, the function orte_ess_generic_component_query in >>>>> "orte/mca/ess/generic/ess_generic_component.c" calls getenv with the >>>>> argument "OMPI_MCA_enc", which seems to cause the module to fail to >>>>> load. shouldnt it be "OMPI_MCA_ess" ? >>>>> >>>>> ..... >>>>> >>>>> /* only pick us if directed to do so */ >>>>> if (NULL != (pick = getenv("OMPI_MCA_env")) && >>>>> 0 == strcmp(pick, "generic")) { >>>>> *priority = 1000; >>>>> *module = (mca_base_module_t *)&orte_ess_generic_module; >>>>> >>>>> ... >>>>> >>>>> p. >>>>> >>>>> On Thu, Jul 22, 2010 at 5:53 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> Dev trunk looks okay right now - I think you'll be fine using it. My new >>>>>> component -might- work with 1.5, but probably not with 1.4. I haven't >>>>>> checked either of them. >>>>>> >>>>>> Anything at r23478 or above will have the new module. Let me know how it >>>>>> works for you. I haven't tested it myself, but am pretty sure it should >>>>>> work. >>>>>> >>>>>> >>>>>> On Jul 22, 2010, at 3:22 PM, Philippe wrote: >>>>>> >>>>>>> Ralph, >>>>>>> >>>>>>> Thank you so much!! >>>>>>> >>>>>>> I'll give it a try and let you know. >>>>>>> >>>>>>> I know it's a tough question, but how stable is the dev trunk? Can I >>>>>>> just grab the latest and run, or am I better off taking your changes >>>>>>> and copy them back in a stable release? (if so, which one? 1.4? 1.5?) >>>>>>> >>>>>>> p. >>>>>>> >>>>>>> On Thu, Jul 22, 2010 at 3:50 PM, Ralph Castain <r...@open-mpi.org> >>>>>>> wrote: >>>>>>>> It was easier for me to just construct this module than to explain how >>>>>>>> to do so :-) >>>>>>>> >>>>>>>> I will commit it this evening (couple of hours from now) as that is >>>>>>>> our standard practice. You'll need to use the developer's trunk, >>>>>>>> though, to use it. >>>>>>>> >>>>>>>> Here are the envars you'll need to provide: >>>>>>>> >>>>>>>> Each process needs to get the same following values: >>>>>>>> >>>>>>>> * OMPI_MCA_ess=generic >>>>>>>> * OMPI_MCA_orte_num_procs=<number of MPI procs> >>>>>>>> * OMPI_MCA_orte_nodes=<a comma-separated list of nodenames where MPI >>>>>>>> procs reside> >>>>>>>> * OMPI_MCA_orte_ppn=<number of procs/node> >>>>>>>> >>>>>>>> Note that I have assumed this last value is a constant for simplicity. >>>>>>>> If that isn't the case, let me know - you could instead provide it as >>>>>>>> a comma-separated list of values with an entry for each node. >>>>>>>> >>>>>>>> In addition, you need to provide the following value that will be >>>>>>>> unique to each process: >>>>>>>> >>>>>>>> * OMPI_MCA_orte_rank=<MPI rank> >>>>>>>> >>>>>>>> Finally, you have to provide a range of static TCP ports for use by >>>>>>>> the processes. Pick any range that you know will be available across >>>>>>>> all the nodes. You then need to ensure that each process sees the >>>>>>>> following envar: >>>>>>>> >>>>>>>> * OMPI_MCA_oob_tcp_static_ports=6000-6010 <== obviously, replace this >>>>>>>> with your range >>>>>>>> >>>>>>>> You will need a port range that is at least equal to the ppn for the >>>>>>>> job (each proc on a node will take one of the provided ports). >>>>>>>> >>>>>>>> That should do it. I compute everything else I need from those values. >>>>>>>> >>>>>>>> Does that work for you? >>>>>>>> Ralph >>>>>>>> >>>>>>>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >