Thanks Jeremiah; I filed the following ticket about this:
https://svn.open-mpi.org/trac/ompi/ticket/2723
On Feb 10, 2011, at 3:24 PM, Jeremiah Willcock wrote:
> I forgot to mention that this was tested with 3 or 4 ranks, connected via TCP.
>
> -- Jeremiah Willcock
>
> On Thu, 10 Feb 2011, Jeremiah Willcock wrote:
>
>> Here is a small test case that hits the bug on 1.4.1:
>>
>> #include <mpi.h>
>>
>> int arr[1142];
>>
>> int main(int argc, char** argv) {
>> int rank, my_size;
>> MPI_Init(&argc, &argv);
>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> my_size = (rank == 1) ? 1142 : 1088;
>> MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD);
>> MPI_Finalize();
>> return 0;
>> }
>>
>> I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might have
>> already been fixed.
>>
>> -- Jeremiah Willcock
>>
>>
>> On Thu, 10 Feb 2011, Jeremiah Willcock wrote:
>>
>>> FYI, I am having trouble finding a small test case that will trigger this
>>> on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have
>>> been fixed. What are the triggering rules for different broadcast
>>> algorithms? It could be that only certain sizes or only certain BTLs
>>> trigger it.
>>> -- Jeremiah Willcock
>>> On Thu, 10 Feb 2011, Jeff Squyres wrote:
>>>> Nifty! Yes, I agree that that's a poor error message. It's probably
>>>> (unfortunately) being propagated up from the underlying point-to-point
>>>> system, where an ERR_IN_STATUS would actually make sense.
>>>> I'll file a ticket about this. Thanks for the heads up.
>>>> On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote:
>>>>> On Wed, 9 Feb 2011, Jeremiah Willcock wrote:
>>>>>> I get the following Open MPI error from 1.4.1:
>>>>>> *** An error occurred in MPI_Bcast
>>>>>> *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>>>>>> *** MPI_ERR_IN_STATUS: error code in status
>>>>>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>>>>> (hostname and port removed from each line). There is no MPI_Status
>>>>>> returned by MPI_Bcast, so I don't know what the error is? Is this
>>>>>> something that people have seen before?
>>>>> For the record, this appears to be caused by specifying inconsistent data
>>>>> sizes on the different ranks in the broadcast operation. The error
>>>>> message could still be improved, though.
>>>>> -- Jeremiah Willcock
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> --
>>>> Jeff Squyres
>>>> [email protected]
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> _______________________________________________
> users mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/