On 11/17/2010 10:48 AM, Ralph Castain wrote:
No problem at all. I confess that I am lost in all the sometimes disjointed emails in this thread. Frankly, now that I search, I can't find it either! :-(

I see one email that clearly shows the external binding report from mpirun, but not from any daemons. I see another email (after you asked if there was all the output) that states "yep", indicating that was all the output, and then proceeds to offer additional output that wasn't in the original email you asked about!

So I am now as thoroughly confused as you are...

That said, I am confident in the code in ORTE as it has worked correctly when I tested it against external bindings in other environments. So I really do believe this is an OGE issue where the orted isn't getting correctly bound against all allocated cores.

I am confused by your statement above because we don't even know what is being bound or not. We know that in it looks like the hnp is bound to 2 cores which is what we asked for but we don't know what any of the processes themselves are bound to. So I personally cannot point to ORTE or OGE as the culprit because I don't think we know whether there is an issue.

So, until we are able to get the -report-bindings output from the a.out code (note I did not say orted) it is kind of hard to claim there is even an issue. Which brings me back to the output question. After some thinking the --report-bindings output I am expecting is not from the orted itself but from the a.out before it executes the user code. Which now makes me wonder if there is some odd OGE/OMPI integration issue which the -bind-to-code -report-bindings options are not being propagated/recognized/honored when qsub is given the -binding option.

Perhaps if someone could run this test again with --report-bindings --leave-session-attached and provide -all- output we could verify that analysis and clear up the confusion?

Yeah, however I bet you we still won't see output.

--td


On Wed, Nov 17, 2010 at 8:13 AM, Terry Dontje <terry.don...@oracle.com <mailto:terry.don...@oracle.com>> wrote:

    On 11/17/2010 10:00 AM, Ralph Castain wrote:
    --leave-session-attached is always required if you want to see
    output from the daemons. Otherwise, the launcher closes the ssh
    session (or qrsh session, in this case) as part of its normal
    operating procedure, thus terminating the stdout/err channel.


    I believe you but isn't it weird that without the --binding option
    to qsub we saw -report-bindings output from the orteds?

    Do you have the date of the email that has the info you talked
    about below.  I really am not trying to be an a-hole about this
    but there have been so much data and email flying around it would
    be nice to actually see the output you mention.

    --td


    On Wed, Nov 17, 2010 at 7:51 AM, Terry Dontje
    <terry.don...@oracle.com <mailto:terry.don...@oracle.com>> wrote:

        On 11/17/2010 09:32 AM, Ralph Castain wrote:
        Cris' output is coming solely from the HNP, which is correct
        given the way things were executed. My comment was from
        another email where he did what I asked, which was to
        include the flags:

        --report-bindings --leave-session-attached

        so we could see the output from each orted. In that email,
        it was clear that while mpirun was bound to multiple cores,
        the orteds are being bound to a -single- core.

        Hence the problem.

        Hmm, I see Ralph's comment on 11/15 but I don't see any
        output that shows what Ralph say's above.  The only
        report-bindings output I see is when he runs without OGE
        binding.   Can someone give me the date and time of Chris'
        email with the --report-bindings and
        --leave-session-attached.  Or a rerun of the below with the
        --leave-session-attached option would also help.

        I find it confusing that --leave-session-attached is not
        required when the OGE binding argument is not given.

        --td

        HTH
        Ralph


        On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje
        <terry.don...@oracle.com <mailto:terry.don...@oracle.com>>
        wrote:

            On 11/17/2010 07:41 AM, Chris Jewell wrote:
            On 17 Nov 2010, at 11:56, Terry Dontje wrote:
            You are absolutely correct, Terry, and the 1.4 release series does 
include the proper code. The point here, though, is that SGE binds the orted to 
a single core, even though other cores are also allocated. So the orted detects 
an external binding of one core, and binds all its children to that same core.
            I do not think you are right here.  Chris sent the following which looks like 
OGE (fka SGE) actually did bind the hnp to multiple cores.  However that message I 
believe is not coming from the processes themselves and actually is only shown by the 
hnp.  I wonder if Chris adds a "-bind-to-core" option  we'll see more output 
from the a.out's before they exec unterm?
            As requested using

            $ qsub -pe mpi 8 -binding linear:2 myScript.com'

            and

            'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core 
-bind-to-core ./unterm'

            [exec5:06671] System has detected external process binding to cores 
0028
            [exec5:06671] ras:gridengine: JOB_ID: 59434
            [exec5:06671] ras:gridengine: PE_HOSTFILE: 
/usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
            [exec5:06671] ras:gridengine: exec5.cluster.stats.local: 
PE_HOSTFILE shows slots=2
            [exec5:06671] ras:gridengine: exec1.cluster.stats.local: 
PE_HOSTFILE shows slots=2
            [exec5:06671] ras:gridengine: exec4.cluster.stats.local: 
PE_HOSTFILE shows slots=1
            [exec5:06671] ras:gridengine: exec3.cluster.stats.local: 
PE_HOSTFILE shows slots=1
            [exec5:06671] ras:gridengine: exec2.cluster.stats.local: 
PE_HOSTFILE shows slots=1
            [exec5:06671] ras:gridengine: exec7.cluster.stats.local: 
PE_HOSTFILE shows slots=1

            No more info.  I note that the external binding is slightly 
different to what I had before, but our cluster is busier today :-)

            I would have expected more output.

            --td

            Chris


            --
            Dr Chris Jewell
            Department of Statistics
            University of Warwick
            Coventry
            CV4 7AL
            UK
            Tel: +44 (0)24 7615 0778






            _______________________________________________
            users mailing list
            us...@open-mpi.org  <mailto:us...@open-mpi.org>
            http://www.open-mpi.org/mailman/listinfo.cgi/users


-- Oracle
            Terry D. Dontje | Principal Software Engineer
            Developer Tools Engineering | +1.781.442.2631
            Oracle *- Performance Technologies*
            95 Network Drive, Burlington, MA 01803
            Email terry.don...@oracle.com
            <mailto:terry.don...@oracle.com>




            _______________________________________________
            users mailing list
            us...@open-mpi.org <mailto:us...@open-mpi.org>
            http://www.open-mpi.org/mailman/listinfo.cgi/users



        _______________________________________________
        users mailing list
        us...@open-mpi.org  <mailto:us...@open-mpi.org>
        http://www.open-mpi.org/mailman/listinfo.cgi/users


-- Oracle
        Terry D. Dontje | Principal Software Engineer
        Developer Tools Engineering | +1.781.442.2631
        Oracle *- Performance Technologies*
        95 Network Drive, Burlington, MA 01803
        Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>




        _______________________________________________
        users mailing list
        us...@open-mpi.org <mailto:us...@open-mpi.org>
        http://www.open-mpi.org/mailman/listinfo.cgi/users



    _______________________________________________
    users mailing list
    us...@open-mpi.org  <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users


-- Oracle
    Terry D. Dontje | Principal Software Engineer
    Developer Tools Engineering | +1.781.442.2631
    Oracle *- Performance Technologies*
    95 Network Drive, Burlington, MA 01803
    Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>




    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



Reply via email to