Sorry I didn't get back to your right away.  1) I'm on the digest, 2) not
real familiar with git and 3) just learned the hard way how to update the
build to work with the latest versions of automake, autoconf, and libtool.
:)

Anyway, I believe the patch is an improvement.  Looking at it, I can tell
you are now checking the first three characters.  I know the plan is to go
to 1.9 and then 2.0, but if the numbering ever went more like the linux
kernel into, say, a 2.10.0 release then your number of characters would be
off.  Also, doesn't the current ABI promise allow 1.7 to be compatible with
1.8?

Personally, I'm fine with the solution, but I wanted to point out the
potential shortcoming(s) should an issue arise again.

One other thought, maybe this is an case where the code should emit a
warning (that could be suppressed with a command line parameter) when the
versions aren't identical?   Certainly if the versions are outside the
"allowed" range (whatever you determine that to be) should be an error and
a refused connection, but rather than silently accepting mixed versions
(which you indicated has caused problems in the past would be to warn of a
potential issue (and users could then consciously suppress the warning if
they are fine with it).  Food for thought.

Unfortunately, the patch didn't actually solve my particular problem (yet,
anyway) because the vendor application statically linked 1.8.3 into their
executable.  (I honestly didn't realize it when I made my previous post).
So the code on their side of the connection is still rejecting the
connection:

[arwild1@hplcslsp2 ~]$ mpirun -n 6 -H localhost vendor_mpi_app
[hplcslsp2:23064] [[44148,1],0] tcp_peer_recv_connect_ack: received
different version from [[44148,0],0]: 1.8.5rc2 instead of 1.8.3
[hplcslsp2:23065] [[44148,1],1] tcp_peer_recv_connect_ack: received
different version from [[44148,0],0]: 1.8.5rc2 instead of 1.8.3
[hplcslsp2:23067] [[44148,1],2] tcp_peer_recv_connect_ack: received
different version from [[44148,0],0]: 1.8.5rc2 instead of 1.8.3
[hplcslsp2:23069] [[44148,1],3] tcp_peer_recv_connect_ack: received
different version from [[44148,0],0]: 1.8.5rc2 instead of 1.8.3
[hplcslsp2:23071] [[44148,1],4] tcp_peer_recv_connect_ack: received
different version from [[44148,0],0]: 1.8.5rc2 instead of 1.8.3
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------


However, I believe if I can get the vendor to adopt this patch (or at least
dynamically link) the patch should help alleviate the need to stay in
lock-step version for version.

Thank you,

-Alan

Reply via email to