Greetings Ralph,
Thank you so much.  I look forward to seeing the new OpenMPI 1.5.x with Xgrid 
support.  There was a point say 10.6.2 and 10.5.8 where OpenMPI 1.4.1 and 
v1.2.9 both worked, with only a minor glitch. 

Since I have a 10.6.3 version, I can retry on that and see what results I get.  
Also, if you have any coaching on OpenMPI, I may be able to help make some 
headway on OMPI 1.5.   It seems to be a large project, and therefore may be a 
major undertaking.  Thus I am hoping that your experience will win out for an 
Xgrid solution.

Again, thank you,
Daniel Beatty
Computer Scientist, Detonation Sciences Branch
Code 474300D
2400 E. Pilot Plant Rd. M/S 1109
China Lake, CA 93555
daniel.bea...@navy.mil
(760)939-7097  

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Thursday, July 29, 2010 11:30
To: Open MPI Users
Subject: Re: [OMPI users] Trouble running OpenMPI compiled for x86_64(either 
m32 or m64)

I'm afraid we were unable to support xgrid after the 1.2 series as no developer 
had access to an xgrid server. I recently received a complimentary copy of 
OSX-server from Apple, and I expect to restore xgrid support at some point in 
the 1.5 series.

It looks like you are hitting some issue with 1.2 relating to a change in xgrid 
between OSX versions. I personally won't be going back that far to deal with 
xgrid issues, so I would suggest sticking with 10.5 if xgrid support is 
required.

Alternatively, you can just use OMPI's rsh support to do the launch. Get an 
xgrid allocation (I don't know enough about xgrid yet to tell you all the 
details), create a hostfile with that info, and then mpirun -hostfile <file> 
-mca plm rsh ...  (assuming you use OMPI 1.4.x).



On Thu, Jul 29, 2010 at 12:20 PM, Beatty, Daniel D CIV NAVAIR, 474300D 
<daniel.bea...@navy.mil> wrote:


        Greetings all,
        I am running into some trouble using OpenMPI with OSX 10.6.4 in a 
Kerberized XGrid environment.  Note, I did not have this trouble before in the 
OSX 10.5.8 Kerberized XGrid environment.

        The pattern of this trouble is as follows:
        1.  User submits a mpi job entering "mpirun -np 4 hello",  to use a 
simple hello world MPI example.
        2.  mpirun will submit the job to XGrid.
        3.  A set of orted jobs get distributed to the machines, under the 
kerberized user's name.
        4.  In the case of the OpenMPI 1.2.8, 1.2.3 compiled for gfortran, 
1.2.8 compiled for gfortran, and 1.2.9 that comes with OSX 10.6.4, it will 
actually spawn the processes on the machine.

        It comes back with the following exception:
        2010-07-29 10:25:49.063 mpirun[949:903] *** Terminating app due to 
uncaught exception 'NSInvalidArgumentException', reason: '*** 
-[NSKVONotifying_XGConnection<0x100130f30> finalize]: called when collecting 
not enabled'
        *** Call stack at first throw:
        (
               0   CoreFoundation                      0x00007fff811f2cc4 
__exceptionPreprocess + 180
               1   libobjc.A.dylib                     0x00007fff851820f3 
objc_exception_throw + 45
               2   CoreFoundation                      0x00007fff8120d9f1 
-[NSObject(NSObject) finalize] + 129
               3   mca_pls_xgrid.so                    0x0000000100297ce3 
-[PlsXGridClient dealloc] + 419
               4   mca_pls_xgrid.so                    0x0000000100297837 
orte_pls_xgrid_finalize + 40
               5   libopen-rte.0.dylib                 0x000000010002d0f9 
orte_pls_base_close + 249
               6   libopen-rte.0.dylib                 0x0000000100012027 
orte_system_finalize + 119
               7   libopen-rte.0.dylib                 0x000000010000e968 
orte_finalize + 40
               8   mpirun                              0x00000001000011ff 
orterun + 2042
               9   mpirun                              0x0000000100000a03 main 
+ 27
               10  mpirun                              0x00000001000009e0 start 
+ 52
               11  ???                                 0x0000000000000004 0x0 + 
4
        )
        terminate called after throwing an instance of 'NSException'
        [bigmac:00949] *** Process received signal ***
        [bigmac:00949] Signal: Abort trap (6)
        [bigmac:00949] Signal code:  (0)
        [bigmac:00949] [ 0] 2   libSystem.B.dylib                   
0x00007fff833e435a _sigtramp + 26
        [bigmac:00949] [ 1] 3   ???                                 
0x00007fff5fbff500 0x0 + 140734799803648
        [bigmac:00949] [ 2] 4   libstdc++.6.dylib                   
0x00007fff80e525d2 __tcf_0 + 0
        [bigmac:00949] [ 3] 5   libobjc.A.dylib                     
0x00007fff85185d29 _objc_terminate + 100
        [bigmac:00949] [ 4] 6   libstdc++.6.dylib                   
0x00007fff80e50ae1 _ZN10__cxxabiv111__terminateEPFvvE + 11
        [bigmac:00949] [ 5] 7   libstdc++.6.dylib                   
0x00007fff80e50b16 _ZN10__cxxabiv112__unexpectedEPFvvE + 0
        [bigmac:00949] [ 6] 8   libstdc++.6.dylib                   
0x00007fff80e50bfc 
_ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0
        [bigmac:00949] [ 7] 9   libobjc.A.dylib                     
0x00007fff85182192 object_getIvar + 0
        [bigmac:00949] [ 8] 10  CoreFoundation                      
0x00007fff8120d9f1 -[NSObject(NSObject) finalize] + 129
        [bigmac:00949] [ 9] 11  mca_pls_xgrid.so                    
0x0000000100297ce3 -[PlsXGridClient dealloc] + 419
        [bigmac:00949] [10] 12  mca_pls_xgrid.so                    
0x0000000100297837 orte_pls_xgrid_finalize + 40
        [bigmac:00949] [11] 13  libopen-rte.0.dylib                 
0x000000010002d0f9 orte_pls_base_close + 249
        [bigmac:00949] [12] 14  libopen-rte.0.dylib                 
0x0000000100012027 orte_system_finalize + 119
        [bigmac:00949] [13] 15  libopen-rte.0.dylib                 
0x000000010000e968 orte_finalize + 40
        [bigmac:00949] [14] 16  mpirun                              
0x00000001000011ff orterun + 2042
        [bigmac:00949] [15] 17  mpirun                              
0x0000000100000a03 main + 27
        [bigmac:00949] [16] 18  mpirun                              
0x00000001000009e0 start + 52
        [bigmac:00949] [17] 19  ???                                 
0x0000000000000004 0x0 + 4
        [bigmac:00949] *** End of error message ***
        Abort trap


        In the case of OpenMPI 1.4.2, I get even worse errors.

        I do not know if this is an XGrid problem or a OMPI problem.  But, it 
is definitely producing trouble.

        Now some have suggested, having XGrid drive OpenMPI, but if 
XGRID_CONTROLLER_HOSTNAME is set, then how will OpenMPI not try to use XGrid as 
the launcher?

        Any ideas as to how to fix this?




        Daniel Beatty
        Computer Scientist, Detonation Sciences Branch
        Code 4743000
        2400 E. Pilot Plant Rd.
        China Lake, CA 93555-6107
        daniel.bea...@navy.mil
        (760) 939-7097


        _______________________________________________
        users mailing list
        us...@open-mpi.org
        http://www.open-mpi.org/mailman/listinfo.cgi/users



Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to