Hmmm…okay, sorry to keep drilling down here, but let’s try adding “-mca sec_base_verbose 100” now
> On Mar 25, 2015, at 8:51 AM, Mark Santcroos <mark.santcr...@rutgers.edu> > wrote: > > marksant@nid25257:~> /u/sciteam/marksant/openmpi/installation/bin/mpirun -mca > oob_base_verbose 100 ./a.out > [nid25257:09350] mca: base: components_register: registering oob components > [nid25257:09350] mca: base: components_register: found loaded component usock > [nid25257:09350] mca: base: components_register: component usock register > function successful > [nid25257:09350] mca: base: components_register: found loaded component alps > [nid25257:09350] mca: base: components_register: component alps register > function successful > [nid25257:09350] mca: base: components_register: found loaded component ud > [nid25257:09350] mca: base: components_register: component ud register > function successful > [nid25257:09350] mca: base: components_register: found loaded component tcp > [nid25257:09350] mca: base: components_register: component tcp register > function successful > [nid25257:09350] mca: base: components_open: opening oob components > [nid25257:09350] mca: base: components_open: found loaded component usock > [nid25257:09350] mca: base: components_open: component usock open function > successful > [nid25257:09350] mca: base: components_open: found loaded component alps > [nid25257:09350] mca: base: components_open: component alps open function > successful > [nid25257:09350] mca: base: components_open: found loaded component ud > [nid25257:09350] mca: base: components_open: component ud open function > successful > [nid25257:09350] mca: base: components_open: found loaded component tcp > [nid25257:09350] mca: base: components_open: component tcp open function > successful > [nid25257:09350] mca:oob:select: checking available component usock > [nid25257:09350] mca:oob:select: Querying component [usock] > [nid25257:09350] oob:usock: component_available called > [nid25257:09350] [[8913,0],0] USOCK STARTUP > [nid25257:09350] SUNPATH: > /var/tmp/openmpi-sessions-45504@nid25257_0/8913/0/usock > [nid25257:09350] [[8913,0],0] START USOCK LISTENING ON > /var/tmp/openmpi-sessions-45504@nid25257_0/8913/0/usock > [nid25257:09350] mca:oob:select: Adding component to end > [nid25257:09350] mca:oob:select: checking available component alps > [nid25257:09350] mca:oob:select: Querying component [alps] > [nid25257:09350] mca:oob:select: Skipping component [alps] - no available > interfaces > [nid25257:09350] mca:oob:select: checking available component ud > [nid25257:09350] mca:oob:select: Querying component [ud] > [nid25257:09350] oob:ud: component_available called > [nid25257:09350] [[8913,0],0] oob:ud:component_init no devices found > [nid25257:09350] mca:oob:select: Skipping component [ud] - failed to startup > [nid25257:09350] mca:oob:select: checking available component tcp > [nid25257:09350] mca:oob:select: Querying component [tcp] > [nid25257:09350] oob:tcp: component_available called > [nid25257:09350] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [nid25257:09350] [[8913,0],0] oob:tcp:init rejecting loopback interface lo > [nid25257:09350] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4 > [nid25257:09350] [[8913,0],0] oob:tcp:init rejecting loopback interface lo > [nid25257:09350] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4 > [nid25257:09350] [[8913,0],0] oob:tcp:init adding 10.128.99.112 to our list > of V4 connections > [nid25257:09350] [[8913,0],0] TCP STARTUP > [nid25257:09350] [[8913,0],0] attempting to bind to IPv4 port 0 > [nid25257:09350] [[8913,0],0] assigned IPv4 port 35917 > [nid25257:09350] mca:oob:select: Adding component to end > [nid25257:09350] mca:oob:select: Found 2 active transports > [nid25257:09350] [[8913,0],0] mca_oob_tcp_listen_thread: new connection: (16, > 0) 10.128.69.144:46745 > [nid25257:09350] [[8913,0],0] connection_handler: working connection (16, 2) > 10.128.69.144:46745 > [nid25257:09350] [[8913,0],0] accept_connection: 10.128.69.144:46745 > [nid25257:09350] [[8913,0],0]:tcp:recv:handler called > [nid25257:09350] [[8913,0],0] RECV CONNECT ACK FROM UNKNOWN ON SOCKET 16 > [nid25257:09350] [[8913,0],0] waiting for connect ack from UNKNOWN > [nid25257:09350] [[8913,0],0] connect ack received from UNKNOWN > [nid25257:09350] [[8913,0],0] connect-ack recvd from UNKNOWN > [nid25257:09350] [[8913,0],0] mca_oob_tcp_recv_connect: connection from new > peer > [nid25257:09350] [[8913,0],0] connect-ack header from [[8913,0],2] is okay > [nid25257:09350] [[8913,0],0] waiting for connect ack from [[8913,0],2] > [nid25257:09350] [[8913,0],0] connect ack received from [[8913,0],2] > [nid25257:09350] [[8913,0],0] connect-ack version from [[8913,0],2] matches > ours > [nid25257:09350] [[8913,0],0] ORTE_ERROR_LOG: Authentication failed in file > ../../../../../orte/mca/oob/tcp/oob_tcp_connection.c at line 803 > [nid25257:09350] [[8913,0],0] mca_oob_tcp_listen_thread: new connection: (17, > 11) 10.128.69.143:33434 > [nid25257:09350] [[8913,0],0] connection_handler: working connection (17, 0) > 10.128.69.143:33434 > [nid25257:09350] [[8913,0],0] accept_connection: 10.128.69.143:33434 > [nid25257:09350] [[8913,0],0]:tcp:recv:handler called > [nid25257:09350] [[8913,0],0] RECV CONNECT ACK FROM UNKNOWN ON SOCKET 17 > [nid25257:09350] [[8913,0],0] waiting for connect ack from UNKNOWN > [nid25257:09350] [[8913,0],0] connect ack received from UNKNOWN > [nid25257:09350] [[8913,0],0] connect-ack recvd from UNKNOWN > [nid25257:09350] [[8913,0],0] mca_oob_tcp_recv_connect: connection from new > peer > [nid25257:09350] [[8913,0],0] connect-ack header from [[8913,0],1] is okay > [nid25257:09350] [[8913,0],0] waiting for connect ack from [[8913,0],1] > [nid25257:09350] [[8913,0],0] connect ack received from [[8913,0],1] > [nid25257:09350] [[8913,0],0] connect-ack version from [[8913,0],1] matches > ours > [nid25257:09350] [[8913,0],0] ORTE_ERROR_LOG: Authentication failed in file > ../../../../../orte/mca/oob/tcp/oob_tcp_connection.c at line 803 > > >> On 25 Mar 2015, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote: >> >> Hmmm…well, it will generate some output, so keep the system down to two >> nodes if you can just to minimize the chatter. Add “-mca oob_base_verbose >> 100” to your cmd line >> >>> On Mar 25, 2015, at 8:45 AM, Mark Santcroos <mark.santcr...@rutgers.edu> >>> wrote: >>> >>> Hi Ralph, >>> >>> There is no OMPI in system space and PATH and LD_LIBRARY_PATH look good. >>> Any suggestion on how to get more relevant debugging info above the table? >>> >>> Thanks >>> >>> Mark >>> >>> >>>> On 25 Mar 2015, at 16:33 , Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>> Hey Mark >>>> >>>> Your original error flag indicates that you are picking up a connection >>>> from some proc built against a different OMPI installation. It’s a very >>>> low-level check that looks for matching version numbers. Not sure who is >>>> trying to connect, but that is the problem. >>>> >>>> Check you LD_LIBRARY_PATH >>>> >>>>> On Mar 25, 2015, at 7:46 AM, Howard Pritchard <hpprit...@gmail.com> wrote: >>>>> >>>>> turn off the disable getpwuid. >>>>> >>>>> On Mar 25, 2015 8:14 AM, "Mark Santcroos" <mark.santcr...@rutgers.edu> >>>>> wrote: >>>>> Hi Howard, >>>>> >>>>>> On 25 Mar 2015, at 14:58 , Howard Pritchard <hpprit...@gmail.com> wrote: >>>>>> How are you building ompi? >>>>> >>>>> My configure is rather straightforward: >>>>> ./configure --prefix=$OMPI_PREFIX --disable-getpwuid >>>>> >>>>> Maybe I got spoiled on Hopper/Edison and I need more explicit >>>>> configuration on BW ... >>>>> >>>>>> Also what happens if you use. aprun. >>>>> >>>>> Not sure if you meant in combination with mpirun or not, so I'll provide >>>>> both: >>>>> >>>>>> aprun -n2 ./a.out >>>>> Hello from rank 1, thread 0, on nid16869. (core affinity = 0) >>>>> Hello from rank 0, thread 0, on nid16868. (core affinity = 0) >>>>> After sleep from rank 1, thread 0, on nid16869. (core affinity = 0) >>>>> After sleep from rank 0, thread 0, on nid16868. (core affinity = 0) >>>>> Application 23791589 resources: utime ~0s, stime ~2s, Rss ~27304, >>>>> inblocks ~13229, outblocks ~66 >>>>> >>>>>> aprun -n2 mpirun ./a.out >>>>> apstat: error opening /ufs/alps_shared/reservations: No such file or >>>>> directory >>>>> apstat: error opening /ufs/alps_shared/reservations: No such file or >>>>> directory >>>>> [nid16868:17876] [[699,0],0] ORTE_ERROR_LOG: File open failure in file >>>>> ../../../../../orte/mca/ras/tm/ras_tm_module.c at line 159 >>>>> [nid16868:17876] [[699,0],0] ORTE_ERROR_LOG: File open failure in file >>>>> ../../../../../orte/mca/ras/tm/ras_tm_module.c at line 85 >>>>> [nid16868:17876] [[699,0],0] ORTE_ERROR_LOG: File open failure in file >>>>> ../../../../orte/mca/ras/base/ras_base_allocate.c at line 190 >>>>> [nid16869:17034] [[9344,0],0] ORTE_ERROR_LOG: File open failure in file >>>>> ../../../../../orte/mca/ras/tm/ras_tm_module.c at line 159 >>>>> [nid16869:17034] [[9344,0],0] ORTE_ERROR_LOG: File open failure in file >>>>> ../../../../../orte/mca/ras/tm/ras_tm_module.c at line 85 >>>>> [nid16869:17034] [[9344,0],0] ORTE_ERROR_LOG: File open failure in file >>>>> ../../../../orte/mca/ras/base/ras_base_allocate.c at line 190 >>>>> Application 23791590 exit codes: 1 >>>>> Application 23791590 resources: utime ~0s, stime ~2s, Rss ~27304, >>>>> inblocks ~9596, outblocks ~478 >>>>> >>>>>> I work with ompi on the nersc edison and hopper daily. >>>>> >>>>> I use Edison and Hopper too, and there it works for me indeed. >>>>> >>>>>> typically i use aprun though. >>>>> >>>>> I want to use orte-submit and friends, so I "explicitly" don't want to >>>>> use aprun. >>>>> >>>>>> you definitely dont need to use ccm. >>>>>> and shouldnt. >>>>> >>>>> Depends on the use-case, but happy to leave that out of scope for now :-) >>>>> >>>>> Thanks! >>>>> >>>>> Mark >>>>> >>>>> >>>>>> >>>>>> On Mar 25, 2015 6:00 AM, "Mark Santcroos" <mark.santcr...@rutgers.edu> >>>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> Any users of Open MPI on Blue Waters here? >>>>>> And then I specifically mean in "native" mode, not inside CCM. >>>>>> >>>>>> After configuring and building as I do on other Cray's, mpirun gives me >>>>>> the following: >>>>>> [nid25263:31700] [[23896,0],0] ORTE_ERROR_LOG: Authentication failed in >>>>>> file ../../../../../orte/mca/oob/tcp/oob_tcp_connection.c at line 803 >>>>>> [nid25263:31700] [[23896,0],0] ORTE_ERROR_LOG: Authentication failed in >>>>>> file ../../../../../orte/mca/oob/tcp/oob_tcp_connection.c at line 803 >>>>>> >>>>>> Version is the latest and greatest from git. >>>>>> >>>>>> So I'm interested to hear whether people have been successful on Blue >>>>>> Waters and/or whether the error rings a bell for people. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Mark >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26505.php >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26506.php >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/03/26507.php >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/03/26508.php >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/03/26510.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/03/26513.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/03/26514.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/03/26515.php