On Wed, May 28, 2014 at 12:32:35AM +0200, Alain Miniussi wrote: > Unfortunately, the debug library works like a charm (which make the > uninitialized variable issue more likely). > > Still, the stack trace point to mca_btl_openib_add_procs in > ompi/mca/btl/openib/btl_openib.c and there is only one division in that > function (although not floating point) at the end: > > openib_btl->local_procs += local_procs; > openib_btl->device->mem_reg_max = calculate_max_reg () / > openib_btl->local_procs; > > now, I'm not sure how much I would trust the local_procs initialization: > > for (i = 0, local_procs = 0 ; i < (int) nprocs; i++) { > > I suspect that a compiler could (wrongly) decide to pass the init of > local_proc if procs = 0 or in a few other corner cases.
If that is the case the compiler has a bug. From C99 § 6.8.5: " 6.8.5.3 The for statement 1 The statement for ( clause-1 ; expression-2 ; expression-3 ) statement behaves as follows: The expression expression-2 is the controlling expression that is evaluated before each execution of the loop body. The expression expression-3 is evaluated as a void expression after each execution of the loop body. If clause-1 is a declaration, the scope of any variables it declares is the remainder of the declaration and the entire loop, including the other two expressions; it is reached in the order of execution before the first evaluation of the controlling expression. If clause-1 is an expression, it is evaluated as a void expression before the first evaluation of the controlling expression.134) 2 Both clause-1 and expression-3 can be omitted. An omitted expression-2 is replaced by a nonzero constant. If clause-1 is an expression, it is evaluated as a void expression before the first evaluation of the controlling expression. " That final bit says that clause-1 will always get evaluated. > Anyway, applying the attache patch on btl_openlib.c seems to fix the issue > on my small case (but I have no exhaustive test suite to run). Not sure what is going on here. Is local_procs = 0 when openib_btl->device->mem_reg_max is calculated? Not sure how that can happen but if it is possible the correct fix is to check for openib_btl->local_procs to be zero. -Nathan
pgpZLX4FkPk4F.pgp
Description: PGP signature