Yes, that's fine. Thx!

On Aug 24, 2010, at 9:02 AM, Philippe wrote:

> awesome, I'll give it a spin! with the parameters as below?
> 
> p.
> 
> On Tue, Aug 24, 2010 at 10:47 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> I think I have this working now - try anything on or after r23647
>> 
>> 
>> On Aug 23, 2010, at 1:36 PM, Philippe wrote:
>> 
>>> sure. I took a guess at ppn and nodes for the case where 2 processes
>>> are on the same node... I dont claim these are the right values ;-)
>>> 
>>> 
>>> 
>>> c0301b10e1 ~/mpi> env|grep OMPI
>>> OMPI_MCA_orte_nodes=c0301b10e1
>>> OMPI_MCA_orte_rank=0
>>> OMPI_MCA_orte_ppn=2
>>> OMPI_MCA_orte_num_procs=2
>>> OMPI_MCA_oob_tcp_static_ports_v6=10000-11000
>>> OMPI_MCA_ess=generic
>>> OMPI_MCA_orte_jobid=9999
>>> OMPI_MCA_oob_tcp_static_ports=10000-11000
>>> c0301b10e1 ~/hpa/benchmark/mpi> ./ben1 1 1 1
>>> [c0301b10e1:22827] [[0,9999],0] assigned port 10001
>>> [c0301b10e1:22827] [[0,9999],0] accepting connections via event library
>>> minsize=1 maxsize=1 delay=1.000000
>>> 
>>> <no more output after that>
>>> 
>>> 
>>> c0301b10e1 ~/mpi> env|grep OMPI
>>> OMPI_MCA_orte_nodes=c0301b10e1
>>> OMPI_MCA_orte_rank=1
>>> OMPI_MCA_orte_ppn=2
>>> OMPI_MCA_orte_num_procs=2
>>> OMPI_MCA_oob_tcp_static_ports_v6=10000-11000
>>> OMPI_MCA_ess=generic
>>> OMPI_MCA_orte_jobid=9999
>>> OMPI_MCA_oob_tcp_static_ports=10000-11000
>>> c0301b10e1 ~/hpa/benchmark/mpi> ./ben1 1 1 1
>>> [c0301b10e1:22830] [[0,9999],1] assigned port 10002
>>> [c0301b10e1:22830] [[0,9999],1] accepting connections via event library
>>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0] mca_oob_tcp_send_nb: tag 15 size 
>>> 189
>>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>>> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
>>> 10.4.72.110:10000
>>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>>> mca_oob_tcp_peer_complete_connect: connection failed: Connection
>>> refused (111) - retrying
>>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>>> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
>>> 10.4.72.110:10000
>>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>>> mca_oob_tcp_peer_complete_connect: connection failed: Connection
>>> refused (111) - retrying
>>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>>> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
>>> 10.4.72.110:10000
>>> [c0301b10e1:22830] [[0,9999],1]-[[0,0],0]
>>> mca_oob_tcp_peer_complete_connect: connection failed: Connection
>>> refused (111) - retrying
>>> 
>>> <repeats..>
>>> 
>>> 
>>> Thanks!
>>> p.
>>> 
>>> 
>>> On Mon, Aug 23, 2010 at 3:24 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> Can you send me the values you are using for the relevant envars? That way 
>>>> I can try to replicate here
>>>> 
>>>> 
>>>> On Aug 23, 2010, at 1:15 PM, Philippe wrote:
>>>> 
>>>>> I took a look at the code but I'm afraid I dont see anything wrong.
>>>>> 
>>>>> p.
>>>>> 
>>>>> On Thu, Aug 19, 2010 at 2:32 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> Yes, that is correct - we reserve the first port in the range for a 
>>>>>> daemon,
>>>>>> should one exist.
>>>>>> The problem is clearly that get_node_rank is returning the wrong value 
>>>>>> for
>>>>>> the second process (your rank=1). If you want to dig deeper, look at the
>>>>>> orte/mca/ess/generic code where it generates the nidmap and pidmap. 
>>>>>> There is
>>>>>> a bug down there somewhere that gives the wrong answer when ppn > 1.
>>>>>> 
>>>>>> 
>>>>>> On Thu, Aug 19, 2010 at 12:12 PM, Philippe <phil...@mytoaster.net> wrote:
>>>>>>> 
>>>>>>> Ralph,
>>>>>>> 
>>>>>>> somewhere in ./orte/mca/oob/tcp/oob_tcp.c, there is this comment:
>>>>>>> 
>>>>>>>                orte_node_rank_t nrank;
>>>>>>>                /* do I know my node_local_rank yet? */
>>>>>>>                if (ORTE_NODE_RANK_INVALID != (nrank =
>>>>>>> orte_ess.get_node_rank(ORTE_PROC_MY_NAME)) &&
>>>>>>>                    (nrank+1) <
>>>>>>> opal_argv_count(mca_oob_tcp_component.tcp4_static_ports)) {
>>>>>>>                    /* any daemon takes the first entry, so we start
>>>>>>> with the second */
>>>>>>> 
>>>>>>> which seems constant with process #0 listening on 10001. the question
>>>>>>> would be why process #1 attempt to connect to port 10000 then? or
>>>>>>> maybe totally unrelated :-)
>>>>>>> 
>>>>>>> btw, if I trick process #1 to open the connection to 10001 by shifting
>>>>>>> the range, I now get this error and the process terminate immediately:
>>>>>>> 
>>>>>>> [c0301b10e1:03919] [[0,9999],1]-[[0,0],0]
>>>>>>> mca_oob_tcp_peer_recv_connect_ack: received unexpected process
>>>>>>> identifier [[0,9999],0]
>>>>>>> 
>>>>>>> good luck with the surgery and wishing you a prompt recovery!
>>>>>>> 
>>>>>>> p.
>>>>>>> 
>>>>>>> On Thu, Aug 19, 2010 at 2:02 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>> wrote:
>>>>>>>> Something doesn't look right - here is what the algo attempts to do:
>>>>>>>> given a port range of 10000-12000, the lowest rank'd process on the 
>>>>>>>> node
>>>>>>>> should open port 10000. The next lowest rank on the node will open
>>>>>>>> 10001,
>>>>>>>> etc.
>>>>>>>> So it looks to me like there is some confusion in the local rank algo.
>>>>>>>> I'll
>>>>>>>> have to look at the generic module - must be a bug in it somewhere.
>>>>>>>> This might take a couple of days as I have surgery tomorrow morning, so
>>>>>>>> please forgive the delay.
>>>>>>>> 
>>>>>>>> On Thu, Aug 19, 2010 at 11:13 AM, Philippe <phil...@mytoaster.net>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Ralph,
>>>>>>>>> 
>>>>>>>>> I'm able to use the generic module when the processes are on different
>>>>>>>>> machines.
>>>>>>>>> 
>>>>>>>>> what would be the values of the EV when two processes are on the same
>>>>>>>>> machine (hopefully talking over SHM).
>>>>>>>>> 
>>>>>>>>> i've played with combination of nodelist and ppn but no luck. I get
>>>>>>>>> errors
>>>>>>>>> like:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> [c0301b10e1:03172] [[0,9999],1] -> [[0,0],0] (node: c0301b10e1)
>>>>>>>>> oob-tcp: Number of attempts to create TCP connection has been
>>>>>>>>> exceeded.  Can not communicate with peer
>>>>>>>>> [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file
>>>>>>>>> grpcomm_hier_module.c at line 303
>>>>>>>>> [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file
>>>>>>>>> base/grpcomm_base_modex.c at line 470
>>>>>>>>> [c0301b10e1:03172] [[0,9999],1] ORTE_ERROR_LOG: Unreachable in file
>>>>>>>>> grpcomm_hier_module.c at line 484
>>>>>>>>> 
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> It looks like MPI_INIT failed for some reason; your parallel process 
>>>>>>>>> is
>>>>>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>>>>>> fail during MPI_INIT; some of which are due to configuration or
>>>>>>>>> environment
>>>>>>>>> problems.  This failure appears to be an internal failure; here's some
>>>>>>>>> additional information (which may only be relevant to an Open MPI
>>>>>>>>> developer):
>>>>>>>>> 
>>>>>>>>>  orte_grpcomm_modex failed
>>>>>>>>>  --> Returned "Unreachable" (-12) instead of "Success" (0)
>>>>>>>>> 
>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>>>>>>>>> *** This is disallowed by the MPI standard.
>>>>>>>>> *** Your MPI job will now abort.
>>>>>>>>> [c0301b10e1:3172] Abort before MPI_INIT completed successfully; not
>>>>>>>>> able to guarantee that all other processes were killed!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> maybe a related question is how to assign the TCP port range and how
>>>>>>>>> is it used? when the processes are on different machines, I use the
>>>>>>>>> same range and that's ok as long as the range is free. but when the
>>>>>>>>> processes are on the same node, what value should the range be for
>>>>>>>>> each process? My range is 10000-12000 (for both processes) and I see
>>>>>>>>> that process with rank #0 listen on port 10001 while process with rank
>>>>>>>>> #1 try to establish a connect to port 10000.
>>>>>>>>> 
>>>>>>>>> Thanks so much!
>>>>>>>>> p. still here... still trying... ;-)
>>>>>>>>> 
>>>>>>>>> On Tue, Jul 27, 2010 at 12:58 AM, Ralph Castain <r...@open-mpi.org>
>>>>>>>>> wrote:
>>>>>>>>>> Use what hostname returns - don't worry about IP addresses as we'll
>>>>>>>>>> discover them.
>>>>>>>>>> 
>>>>>>>>>> On Jul 26, 2010, at 10:45 PM, Philippe wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>> 
>>>>>>>>>>> now, for the ev "OMPI_MCA_orte_nodes", what do I put exactly? our
>>>>>>>>>>> nodes have a short/long name (it's rhel 5.x, so the command hostname
>>>>>>>>>>> returns the long name) and at least 2 IP addresses.
>>>>>>>>>>> 
>>>>>>>>>>> p.
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Jul 27, 2010 at 12:06 AM, Ralph Castain <r...@open-mpi.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Okay, fixed in r23499. Thanks again...
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 26, 2010, at 9:47 PM, Ralph Castain wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Doh - yes it should! I'll fix it right now.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jul 26, 2010, at 9:28 PM, Philippe wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ralph,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> i was able to test the generic module and it seems to be working.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> one question tho, the function orte_ess_generic_component_query
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> "orte/mca/ess/generic/ess_generic_component.c" calls getenv with
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> argument "OMPI_MCA_enc", which seems to cause the module to fail
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> load. shouldnt it be "OMPI_MCA_ess" ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> .....
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>   /* only pick us if directed to do so */
>>>>>>>>>>>>>>   if (NULL != (pick = getenv("OMPI_MCA_env")) &&
>>>>>>>>>>>>>>                0 == strcmp(pick, "generic")) {
>>>>>>>>>>>>>>       *priority = 1000;
>>>>>>>>>>>>>>       *module = (mca_base_module_t *)&orte_ess_generic_module;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> p.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Jul 22, 2010 at 5:53 PM, Ralph Castain 
>>>>>>>>>>>>>> <r...@open-mpi.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Dev trunk looks okay right now - I think you'll be fine using
>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>> My new component -might- work with 1.5, but probably not with
>>>>>>>>>>>>>>> 1.4. I haven't
>>>>>>>>>>>>>>> checked either of them.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Anything at r23478 or above will have the new module. Let me
>>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>> how it works for you. I haven't tested it myself, but am pretty
>>>>>>>>>>>>>>> sure it
>>>>>>>>>>>>>>> should work.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Jul 22, 2010, at 3:22 PM, Philippe wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Ralph,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thank you so much!!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'll give it a try and let you know.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I know it's a tough question, but how stable is the dev trunk?
>>>>>>>>>>>>>>>> Can
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> just grab the latest and run, or am I better off taking your
>>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>> and copy them back in a stable release? (if so, which one? 1.4?
>>>>>>>>>>>>>>>> 1.5?)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> p.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Jul 22, 2010 at 3:50 PM, Ralph Castain
>>>>>>>>>>>>>>>> <r...@open-mpi.org>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> It was easier for me to just construct this module than to
>>>>>>>>>>>>>>>>> explain how to do so :-)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I will commit it this evening (couple of hours from now) as
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> is our standard practice. You'll need to use the developer's
>>>>>>>>>>>>>>>>> trunk, though,
>>>>>>>>>>>>>>>>> to use it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Here are the envars you'll need to provide:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Each process needs to get the same following values:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> * OMPI_MCA_ess=generic
>>>>>>>>>>>>>>>>> * OMPI_MCA_orte_num_procs=<number of MPI procs>
>>>>>>>>>>>>>>>>> * OMPI_MCA_orte_nodes=<a comma-separated list of nodenames
>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>> MPI procs reside>
>>>>>>>>>>>>>>>>> * OMPI_MCA_orte_ppn=<number of procs/node>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Note that I have assumed this last value is a constant for
>>>>>>>>>>>>>>>>> simplicity. If that isn't the case, let me know - you could
>>>>>>>>>>>>>>>>> instead provide
>>>>>>>>>>>>>>>>> it as a comma-separated list of values with an entry for each
>>>>>>>>>>>>>>>>> node.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> In addition, you need to provide the following value that will
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>> unique to each process:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> * OMPI_MCA_orte_rank=<MPI rank>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Finally, you have to provide a range of static TCP ports for
>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>> by the processes. Pick any range that you know will be
>>>>>>>>>>>>>>>>> available across all
>>>>>>>>>>>>>>>>> the nodes. You then need to ensure that each process sees the
>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>> envar:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> * OMPI_MCA_oob_tcp_static_ports=6000-6010  <== obviously,
>>>>>>>>>>>>>>>>> replace
>>>>>>>>>>>>>>>>> this with your range
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> You will need a port range that is at least equal to the ppn
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> the job (each proc on a node will take one of the provided
>>>>>>>>>>>>>>>>> ports).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> That should do it. I compute everything else I need from those
>>>>>>>>>>>>>>>>> values.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Does that work for you?
>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to