[OMPI users] problem when mpi_paffinity_alone is set to 1
Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to-point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_alone is set to 0. For a number of processes which is larger than about 10, global communications just don't seem possible. Point-to-point communications seem to be OK. But when I specify --mca mpi_paffinity_alone 1 in my command line, I get the following error: mbind: Invalid argument I looked into the code of maffinity/libnuma, and found out the error comes from numa_setlocal_memory(segments[i].mbs_start_addr, segments[i].mbs_len); in maffinity_libnuma_module.c. The machine I am using is a Linux box running a 2.6.5-7 kernel. Has anyone experienced a similar problem? Camille
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
Hi Camille What OMPI version are you using? We just changed the paffinity module last night, but did nothing to maffinity. However, it is possible that the maffinity framework makes some calls into paffinity that need to adjust. So version number would help a great deal in this case. Thanks Ralph On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to-point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_alone is set to 0. For a number of processes which is larger than about 10, global communications just don't seem possible. Point-to-point communications seem to be OK. But when I specify --mca mpi_paffinity_alone 1 in my command line, I get the following error: mbind: Invalid argument I looked into the code of maffinity/libnuma, and found out the error comes from numa_setlocal_memory(segments[i].mbs_start_addr, segments[i].mbs_len); in maffinity_libnuma_module.c. The machine I am using is a Linux box running a 2.6.5-7 kernel. Has anyone experienced a similar problem? Camille ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
Ralph, I compiled a clean checkout from the trunk (r19392), the problem is still the same. Camille Ralph Castain a écrit : Hi Camille What OMPI version are you using? We just changed the paffinity module last night, but did nothing to maffinity. However, it is possible that the maffinity framework makes some calls into paffinity that need to adjust. So version number would help a great deal in this case. Thanks Ralph On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to-point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_alone is set to 0. For a number of processes which is larger than about 10, global communications just don't seem possible. Point-to-point communications seem to be OK. But when I specify --mca mpi_paffinity_alone 1 in my command line, I get the following error: mbind: Invalid argument I looked into the code of maffinity/libnuma, and found out the error comes from numa_setlocal_memory(segments[i].mbs_start_addr, segments[i].mbs_len); in maffinity_libnuma_module.c. The machine I am using is a Linux box running a 2.6.5-7 kernel. Has anyone experienced a similar problem? Camille ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
Okay, I'll look into it. I suspect the problem is due to the redefinition of the paffinity API to clarify physical vs logical processors - more than likely, the maffinity interface suffers from the same problem we had to correct over there. We'll report back later with an estimate of how quickly this can be fixed. Thanks Ralph On Aug 22, 2008, at 7:03 AM, Camille Coti wrote: Ralph, I compiled a clean checkout from the trunk (r19392), the problem is still the same. Camille Ralph Castain a écrit : Hi Camille What OMPI version are you using? We just changed the paffinity module last night, but did nothing to maffinity. However, it is possible that the maffinity framework makes some calls into paffinity that need to adjust. So version number would help a great deal in this case. Thanks Ralph On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to-point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_alone is set to 0. For a number of processes which is larger than about 10, global communications just don't seem possible. Point-to-point communications seem to be OK. But when I specify --mca mpi_paffinity_alone 1 in my command line, I get the following error: mbind: Invalid argument I looked into the code of maffinity/libnuma, and found out the error comes from numa_setlocal_memory(segments[i].mbs_start_addr, segments[i].mbs_len); in maffinity_libnuma_module.c. The machine I am using is a Linux box running a 2.6.5-7 kernel. Has anyone experienced a similar problem? Camille ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
OK, thank you! Camille Ralph Castain a écrit : Okay, I'll look into it. I suspect the problem is due to the redefinition of the paffinity API to clarify physical vs logical processors - more than likely, the maffinity interface suffers from the same problem we had to correct over there. We'll report back later with an estimate of how quickly this can be fixed. Thanks Ralph On Aug 22, 2008, at 7:03 AM, Camille Coti wrote: Ralph, I compiled a clean checkout from the trunk (r19392), the problem is still the same. Camille Ralph Castain a écrit : Hi Camille What OMPI version are you using? We just changed the paffinity module last night, but did nothing to maffinity. However, it is possible that the maffinity framework makes some calls into paffinity that need to adjust. So version number would help a great deal in this case. Thanks Ralph On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to-point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_alone is set to 0. For a number of processes which is larger than about 10, global communications just don't seem possible. Point-to-point communications seem to be OK. But when I specify --mca mpi_paffinity_alone 1 in my command line, I get the following error: mbind: Invalid argument I looked into the code of maffinity/libnuma, and found out the error comes from numa_setlocal_memory(segments[i].mbs_start_addr, segments[i].mbs_len); in maffinity_libnuma_module.c. The machine I am using is a Linux box running a 2.6.5-7 kernel. Has anyone experienced a similar problem? Camille ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
I believe I have found the problem, and it may indeed relate to the change in paffinity. By any chance, do you have unfilled sockets on that machine? Could you provide the output from something like "cat / proc/cpuinfo" (or the equiv for your system) so we could see what physical processors and sockets are present? If I'm correct as to the problem, here is the issue. OMPI has (until now) always assumed that the #logical processors (or sockets, or cores) was the same as the #physical processors (or sockets, or cores). As a result, several key subsystems were written without making any distinction as to which (logical vs physical) they were referring to. This was no problem until we recently encountered systems with "holes" in their system - a processor turned "off", or a socket unpopulated, etc. In this case, the local processor id no longer matches the physical processor id (ditto for sockets and cores). We adjusted the paffinity subsystem to deal with it - took much more effort than we would have liked, and exposed lots of inconsistencies in how the base operating systems handle such situations. Unfortunately, having gotten that straightened out, it is possible that you have uncovered a similar inconsistency in logical vs physical in another subsystem. I have asked better eyes than mine to take a look at that now to confirm - if so, it could take us a little while to fix. My request for info was aimed at helping us to determine why your system is seeing this problem, but our tests didn't. We have tested the revised paffinity on both completely filled and on at least one system with "holes", but differences in OS levels, processor types, etc could have caused our tests to pass while your system fails. I'm particularly suspicious of the old kernel you are running and how our revised code will handle it. For now, I would suggest you work with revisions lower than r19391 - could you please confirm that r19390 or earlier works? Thanks Ralph On Aug 22, 2008, at 7:21 AM, Camille Coti wrote: OK, thank you! Camille Ralph Castain a écrit : Okay, I'll look into it. I suspect the problem is due to the redefinition of the paffinity API to clarify physical vs logical processors - more than likely, the maffinity interface suffers from the same problem we had to correct over there. We'll report back later with an estimate of how quickly this can be fixed. Thanks Ralph On Aug 22, 2008, at 7:03 AM, Camille Coti wrote: Ralph, I compiled a clean checkout from the trunk (r19392), the problem is still the same. Camille Ralph Castain a écrit : Hi Camille What OMPI version are you using? We just changed the paffinity module last night, but did nothing to maffinity. However, it is possible that the maffinity framework makes some calls into paffinity that need to adjust. So version number would help a great deal in this case. Thanks Ralph On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to-point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_alone is set to 0. For a number of processes which is larger than about 10, global communications just don't seem possible. Point-to-point communications seem to be OK. But when I specify --mca mpi_paffinity_alone 1 in my command line, I get the following error: mbind: Invalid argument I looked into the code of maffinity/libnuma, and found out the error comes from numa_setlocal_memory(segments[i].mbs_start_addr, segments[i].mbs_len); in maffinity_libnuma_module.c. The machine I am using is a Linux box running a 2.6.5-7 kernel. Has anyone experienced a similar problem? Camille ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
Back on Mon 1st Sept If action is required before then ... please contact Rob Giddings (Catia/VPM/HDMS issues) For Nastran/CAE technical S/W, Chris Catchpole can help. For Elecricad, Chris Toyne can help.
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
Ralph, How does OpenMPI pick up the map between physical vs. logical processors?Does OMPI look into "/sys/devices/system/node/node for the cpu topology? Thanks, Mi Yan Ralph Castain Sent by: To users-bounces@ope Open MPI Users n-mpi.org cc Subject 08/22/2008 09:16 Re: [OMPI users] problem when AMmpi_paffinity_alone is set to 1 Please respond to Open MPI Users Okay, I'll look into it. I suspect the problem is due to the redefinition of the paffinity API to clarify physical vs logical processors - more than likely, the maffinity interface suffers from the same problem we had to correct over there. We'll report back later with an estimate of how quickly this can be fixed. Thanks Ralph On Aug 22, 2008, at 7:03 AM, Camille Coti wrote: > > Ralph, > > I compiled a clean checkout from the trunk (r19392), the problem is > still the same. > > Camille > > > Ralph Castain a écrit : >> Hi Camille >> What OMPI version are you using? We just changed the paffinity >> module last night, but did nothing to maffinity. However, it is >> possible that the maffinity framework makes some calls into >> paffinity that need to adjust. >> So version number would help a great deal in this case. >> Thanks >> Ralph >> On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: >>> Hello, >>> >>> I am trying to run applications on a shared-memory machine. For >>> the moment I am just trying to run tests on point-to-point >>> communications (a trivial token ring) and collective operations >>> (from the SkaMPI tests suite). >>> >>> It runs smoothly if mpi_paffinity_alone is set to 0. For a number >>> of processes which is larger than about 10, global communications >>> just don't seem possible. Point-to-point communications seem to be >>> OK. >>> >>> But when I specify --mca mpi_paffinity_alone 1 in my command >>> line, I get the following error: >>> >>> mbind: Invalid argument >>> >>> I looked into the code of maffinity/libnuma, and found out the >>> error comes from >>> >>> numa_setlocal_memory(segments[i].mbs_start_addr, >>>segments[i].mbs_len); >>> >>> in maffinity_libnuma_module.c. >>> >>> The machine I am using is a Linux box running a 2.6.5-7 kernel. >>> >>> Has anyone experienced a similar problem? >>> >>> Camille >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
Actually, I have tried with several versions, since you were working on the affinity thing. I have tried with revision 19103 a couple a weeks ago, the problem was already there. Part of /proc/cpuinfo is below: processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev: 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz: 900.00 itc MHz: 900.00 BogoMIPS : 1325.40 siblings : 1 The machine is a 60-way Altix machine, so you have 60 times this information in /proc/cpuinfo (yes, 60, not 64). Camille Ralph Castain a écrit : I believe I have found the problem, and it may indeed relate to the change in paffinity. By any chance, do you have unfilled sockets on that machine? Could you provide the output from something like "cat /proc/cpuinfo" (or the equiv for your system) so we could see what physical processors and sockets are present? If I'm correct as to the problem, here is the issue. OMPI has (until now) always assumed that the #logical processors (or sockets, or cores) was the same as the #physical processors (or sockets, or cores). As a result, several key subsystems were written without making any distinction as to which (logical vs physical) they were referring to. This was no problem until we recently encountered systems with "holes" in their system - a processor turned "off", or a socket unpopulated, etc. In this case, the local processor id no longer matches the physical processor id (ditto for sockets and cores). We adjusted the paffinity subsystem to deal with it - took much more effort than we would have liked, and exposed lots of inconsistencies in how the base operating systems handle such situations. Unfortunately, having gotten that straightened out, it is possible that you have uncovered a similar inconsistency in logical vs physical in another subsystem. I have asked better eyes than mine to take a look at that now to confirm - if so, it could take us a little while to fix. My request for info was aimed at helping us to determine why your system is seeing this problem, but our tests didn't. We have tested the revised paffinity on both completely filled and on at least one system with "holes", but differences in OS levels, processor types, etc could have caused our tests to pass while your system fails. I'm particularly suspicious of the old kernel you are running and how our revised code will handle it. For now, I would suggest you work with revisions lower than r19391 - could you please confirm that r19390 or earlier works? Thanks Ralph On Aug 22, 2008, at 7:21 AM, Camille Coti wrote: OK, thank you! Camille Ralph Castain a écrit : Okay, I'll look into it. I suspect the problem is due to the redefinition of the paffinity API to clarify physical vs logical processors - more than likely, the maffinity interface suffers from the same problem we had to correct over there. We'll report back later with an estimate of how quickly this can be fixed. Thanks Ralph On Aug 22, 2008, at 7:03 AM, Camille Coti wrote: Ralph, I compiled a clean checkout from the trunk (r19392), the problem is still the same. Camille Ralph Castain a écrit : Hi Camille What OMPI version are you using? We just changed the paffinity module last night, but did nothing to maffinity. However, it is possible that the maffinity framework makes some calls into paffinity that need to adjust. So version number would help a great deal in this case. Thanks Ralph On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to-point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_alone is set to 0. For a number of processes which is larger than about 10, global communications just don't seem possible. Point-to-point communications seem to be OK. But when I specify --mca mpi_paffinity_alone 1 in my command line, I get the following error: mbind: Invalid argument I looked into the code of maffinity/libnuma, and found out the error comes from numa_setlocal_memory(segments[i].mbs_start_addr, segments[i].mbs_len); in maffinity_libnuma_module.c. The machine I am using is a Linux box running a 2.6.5-7 kernel. Has anyone experienced a similar problem? Camille ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
Short answer is: yes. Unfortunately, different systems store that info in different places. For Linux, we use the PLPA to help us discover the required info. Solaris, OSX, and Windows all have their own ways of providing it. The paffinity framework detects the type of system we are running on and "does the right thing" to get the info. Where we simply cannot get it, we return an error and let you know that we cannot support processor affinity on this machine. You can still execute, of course - you just can't set mpi_paffinity_alone since we can't meet that request on such a system. Ralph On Aug 22, 2008, at 8:01 AM, Mi Yan wrote: Ralph, How does OpenMPI pick up the map between physical vs. logical processors? Does OMPI look into "/sys/devices/system/node/node for the cpu topology? Thanks, Mi Yan Ralph Castain Ralph Castain Sent by: users-boun...@open-mpi.org 08/22/2008 09:16 AM Please respond to Open MPI Users To Open MPI Users cc Subject Re: [OMPI users] problem when mpi_paffinity_alone is set to 1 Okay, I'll look into it. I suspect the problem is due to the redefinition of the paffinity API to clarify physical vs logical processors - more than likely, the maffinity interface suffers from the same problem we had to correct over there. We'll report back later with an estimate of how quickly this can be fixed. Thanks Ralph On Aug 22, 2008, at 7:03 AM, Camille Coti wrote: > > Ralph, > > I compiled a clean checkout from the trunk (r19392), the problem is > still the same. > > Camille > > > Ralph Castain a écrit : >> Hi Camille >> What OMPI version are you using? We just changed the paffinity >> module last night, but did nothing to maffinity. However, it is >> possible that the maffinity framework makes some calls into >> paffinity that need to adjust. >> So version number would help a great deal in this case. >> Thanks >> Ralph >> On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: >>> Hello, >>> >>> I am trying to run applications on a shared-memory machine. For >>> the moment I am just trying to run tests on point-to-point >>> communications (a trivial token ring) and collective operations >>> (from the SkaMPI tests suite). >>> >>> It runs smoothly if mpi_paffinity_alone is set to 0. For a number >>> of processes which is larger than about 10, global communications >>> just don't seem possible. Point-to-point communications seem to be >>> OK. >>> >>> But when I specify --mca mpi_paffinity_alone 1 in my command >>> line, I get the following error: >>> >>> mbind: Invalid argument >>> >>> I looked into the code of maffinity/libnuma, and found out the >>> error comes from >>> >>> numa_setlocal_memory(segments[i].mbs_start_addr, >>>segments[i].mbs_len); >>> >>> in maffinity_libnuma_module.c. >>> >>> The machine I am using is a Linux box running a 2.6.5-7 kernel. >>> >>> Has anyone experienced a similar problem? >>> >>> Camille >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
Thanks! Well, it -is- nice to know that we didn't -create- the problem with the paffinity change! We'll have to think about this one a little to try and figure out why this is happening. Ralph On Aug 22, 2008, at 8:00 AM, Camille Coti wrote: Actually, I have tried with several versions, since you were working on the affinity thing. I have tried with revision 19103 a couple a weeks ago, the problem was already there. Part of /proc/cpuinfo is below: processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev: 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz: 900.00 itc MHz: 900.00 BogoMIPS : 1325.40 siblings : 1 The machine is a 60-way Altix machine, so you have 60 times this information in /proc/cpuinfo (yes, 60, not 64). Camille Ralph Castain a écrit : I believe I have found the problem, and it may indeed relate to the change in paffinity. By any chance, do you have unfilled sockets on that machine? Could you provide the output from something like "cat /proc/cpuinfo" (or the equiv for your system) so we could see what physical processors and sockets are present? If I'm correct as to the problem, here is the issue. OMPI has (until now) always assumed that the #logical processors (or sockets, or cores) was the same as the #physical processors (or sockets, or cores). As a result, several key subsystems were written without making any distinction as to which (logical vs physical) they were referring to. This was no problem until we recently encountered systems with "holes" in their system - a processor turned "off", or a socket unpopulated, etc. In this case, the local processor id no longer matches the physical processor id (ditto for sockets and cores). We adjusted the paffinity subsystem to deal with it - took much more effort than we would have liked, and exposed lots of inconsistencies in how the base operating systems handle such situations. Unfortunately, having gotten that straightened out, it is possible that you have uncovered a similar inconsistency in logical vs physical in another subsystem. I have asked better eyes than mine to take a look at that now to confirm - if so, it could take us a little while to fix. My request for info was aimed at helping us to determine why your system is seeing this problem, but our tests didn't. We have tested the revised paffinity on both completely filled and on at least one system with "holes", but differences in OS levels, processor types, etc could have caused our tests to pass while your system fails. I'm particularly suspicious of the old kernel you are running and how our revised code will handle it. For now, I would suggest you work with revisions lower than r19391 - could you please confirm that r19390 or earlier works? Thanks Ralph On Aug 22, 2008, at 7:21 AM, Camille Coti wrote: OK, thank you! Camille Ralph Castain a écrit : Okay, I'll look into it. I suspect the problem is due to the redefinition of the paffinity API to clarify physical vs logical processors - more than likely, the maffinity interface suffers from the same problem we had to correct over there. We'll report back later with an estimate of how quickly this can be fixed. Thanks Ralph On Aug 22, 2008, at 7:03 AM, Camille Coti wrote: Ralph, I compiled a clean checkout from the trunk (r19392), the problem is still the same. Camille Ralph Castain a écrit : Hi Camille What OMPI version are you using? We just changed the paffinity module last night, but did nothing to maffinity. However, it is possible that the maffinity framework makes some calls into paffinity that need to adjust. So version number would help a great deal in this case. Thanks Ralph On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to-point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_alone is set to 0. For a number of processes which is larger than about 10, global communications just don't seem possible. Point-to-point communications seem to be OK. But when I specify --mca mpi_paffinity_alone 1 in my command line, I get the following error: mbind: Invalid argument I looked into the code of maffinity/libnuma, and found out the error comes from numa_setlocal_memory(segments[i].mbs_start_addr, segments[i].mbs_len); in maffinity_libnuma_module.c. The machine I am using is a Linux box running a 2.6.5-7 kernel. Has anyone experienced a similar problem? Camille ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
Camile -- Can you also send the output of "uname -a"? Also, just to be absoultely sure, let's check that PLPA is doing the Right thing here (we don't think this is problem, but it's worth checking). Grab the latest beta: http://www.open-mpi.org/software/plpa/v1.2/ It's a very small package and easy to install under your $HOME (or whatever). Can you send the output of "plpa-info --topo"? On Aug 22, 2008, at 7:00 AM, Camille Coti wrote: Actually, I have tried with several versions, since you were working on the affinity thing. I have tried with revision 19103 a couple a weeks ago, the problem was already there. Part of /proc/cpuinfo is below: processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev: 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz: 900.00 itc MHz: 900.00 BogoMIPS : 1325.40 siblings : 1 The machine is a 60-way Altix machine, so you have 60 times this information in /proc/cpuinfo (yes, 60, not 64). Camille Ralph Castain a écrit : I believe I have found the problem, and it may indeed relate to the change in paffinity. By any chance, do you have unfilled sockets on that machine? Could you provide the output from something like "cat /proc/cpuinfo" (or the equiv for your system) so we could see what physical processors and sockets are present? If I'm correct as to the problem, here is the issue. OMPI has (until now) always assumed that the #logical processors (or sockets, or cores) was the same as the #physical processors (or sockets, or cores). As a result, several key subsystems were written without making any distinction as to which (logical vs physical) they were referring to. This was no problem until we recently encountered systems with "holes" in their system - a processor turned "off", or a socket unpopulated, etc. In this case, the local processor id no longer matches the physical processor id (ditto for sockets and cores). We adjusted the paffinity subsystem to deal with it - took much more effort than we would have liked, and exposed lots of inconsistencies in how the base operating systems handle such situations. Unfortunately, having gotten that straightened out, it is possible that you have uncovered a similar inconsistency in logical vs physical in another subsystem. I have asked better eyes than mine to take a look at that now to confirm - if so, it could take us a little while to fix. My request for info was aimed at helping us to determine why your system is seeing this problem, but our tests didn't. We have tested the revised paffinity on both completely filled and on at least one system with "holes", but differences in OS levels, processor types, etc could have caused our tests to pass while your system fails. I'm particularly suspicious of the old kernel you are running and how our revised code will handle it. For now, I would suggest you work with revisions lower than r19391 - could you please confirm that r19390 or earlier works? Thanks Ralph On Aug 22, 2008, at 7:21 AM, Camille Coti wrote: OK, thank you! Camille Ralph Castain a écrit : Okay, I'll look into it. I suspect the problem is due to the redefinition of the paffinity API to clarify physical vs logical processors - more than likely, the maffinity interface suffers from the same problem we had to correct over there. We'll report back later with an estimate of how quickly this can be fixed. Thanks Ralph On Aug 22, 2008, at 7:03 AM, Camille Coti wrote: Ralph, I compiled a clean checkout from the trunk (r19392), the problem is still the same. Camille Ralph Castain a écrit : Hi Camille What OMPI version are you using? We just changed the paffinity module last night, but did nothing to maffinity. However, it is possible that the maffinity framework makes some calls into paffinity that need to adjust. So version number would help a great deal in this case. Thanks Ralph On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to-point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_alone is set to 0. For a number of processes which is larger than about 10, global communications just don't seem possible. Point-to-point communications seem to be OK. But when I specify --mca mpi_paffinity_alone 1 in my command line, I get the following error: mbind: Invalid argument I looked into the code of maffinity/libnuma, and found out the error comes from numa_setlocal_memory(segments[i].mbs_start_addr, segments[i].mbs_len); in maffinity_libnuma_module.c. The machine I am using is a Linux box running a 2.6.5-7 kernel. Has anyone experienced a similar problem? Camille
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
inria@behemoth:~$ uname -a Linux behemoth 2.6.5-7.283-sn2 #1 SMP Wed Nov 29 16:55:53 UTC 2006 ia64 ia64 ia64 GNU/Linux I am not sure the output of plpa-info --topo gives good news... inria@behemoth:~$ plpa-info --topo Kernel affinity support: yes Kernel topology support: no Number of processor sockets: unknown Kernel topology not supported -- cannot show topology information Camille Jeff Squyres a écrit : Camile -- Can you also send the output of "uname -a"? Also, just to be absoultely sure, let's check that PLPA is doing the Right thing here (we don't think this is problem, but it's worth checking). Grab the latest beta: http://www.open-mpi.org/software/plpa/v1.2/ It's a very small package and easy to install under your $HOME (or whatever). Can you send the output of "plpa-info --topo"? On Aug 22, 2008, at 7:00 AM, Camille Coti wrote: Actually, I have tried with several versions, since you were working on the affinity thing. I have tried with revision 19103 a couple a weeks ago, the problem was already there. Part of /proc/cpuinfo is below: processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev: 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz: 900.00 itc MHz: 900.00 BogoMIPS : 1325.40 siblings : 1 The machine is a 60-way Altix machine, so you have 60 times this information in /proc/cpuinfo (yes, 60, not 64). Camille Ralph Castain a écrit : I believe I have found the problem, and it may indeed relate to the change in paffinity. By any chance, do you have unfilled sockets on that machine? Could you provide the output from something like "cat /proc/cpuinfo" (or the equiv for your system) so we could see what physical processors and sockets are present? If I'm correct as to the problem, here is the issue. OMPI has (until now) always assumed that the #logical processors (or sockets, or cores) was the same as the #physical processors (or sockets, or cores). As a result, several key subsystems were written without making any distinction as to which (logical vs physical) they were referring to. This was no problem until we recently encountered systems with "holes" in their system - a processor turned "off", or a socket unpopulated, etc. In this case, the local processor id no longer matches the physical processor id (ditto for sockets and cores). We adjusted the paffinity subsystem to deal with it - took much more effort than we would have liked, and exposed lots of inconsistencies in how the base operating systems handle such situations. Unfortunately, having gotten that straightened out, it is possible that you have uncovered a similar inconsistency in logical vs physical in another subsystem. I have asked better eyes than mine to take a look at that now to confirm - if so, it could take us a little while to fix. My request for info was aimed at helping us to determine why your system is seeing this problem, but our tests didn't. We have tested the revised paffinity on both completely filled and on at least one system with "holes", but differences in OS levels, processor types, etc could have caused our tests to pass while your system fails. I'm particularly suspicious of the old kernel you are running and how our revised code will handle it. For now, I would suggest you work with revisions lower than r19391 - could you please confirm that r19390 or earlier works? Thanks Ralph On Aug 22, 2008, at 7:21 AM, Camille Coti wrote: OK, thank you! Camille Ralph Castain a écrit : Okay, I'll look into it. I suspect the problem is due to the redefinition of the paffinity API to clarify physical vs logical processors - more than likely, the maffinity interface suffers from the same problem we had to correct over there. We'll report back later with an estimate of how quickly this can be fixed. Thanks Ralph On Aug 22, 2008, at 7:03 AM, Camille Coti wrote: Ralph, I compiled a clean checkout from the trunk (r19392), the problem is still the same. Camille Ralph Castain a écrit : Hi Camille What OMPI version are you using? We just changed the paffinity module last night, but did nothing to maffinity. However, it is possible that the maffinity framework makes some calls into paffinity that need to adjust. So version number would help a great deal in this case. Thanks Ralph On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to-point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_alone is set to 0. For a number of processes which is larger than about 10, global communications just don't seem possible. Point-to-point communications seem to be OK. But when I specify --mca mpi_paffinity_alone 1 in my command line, I get the follow
Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
Ah, this is a fairly kernel -- it does not support the topology stuff. So in this case, logical and physical IDs should be the same. Hmm. Need to think about that... On Aug 22, 2008, at 8:47 AM, Camille Coti wrote: inria@behemoth:~$ uname -a Linux behemoth 2.6.5-7.283-sn2 #1 SMP Wed Nov 29 16:55:53 UTC 2006 ia64 ia64 ia64 GNU/Linux I am not sure the output of plpa-info --topo gives good news... inria@behemoth:~$ plpa-info --topo Kernel affinity support: yes Kernel topology support: no Number of processor sockets: unknown Kernel topology not supported -- cannot show topology information Camille Jeff Squyres a écrit : Camile -- Can you also send the output of "uname -a"? Also, just to be absoultely sure, let's check that PLPA is doing the Right thing here (we don't think this is problem, but it's worth checking). Grab the latest beta: http://www.open-mpi.org/software/plpa/v1.2/ It's a very small package and easy to install under your $HOME (or whatever). Can you send the output of "plpa-info --topo"? On Aug 22, 2008, at 7:00 AM, Camille Coti wrote: Actually, I have tried with several versions, since you were working on the affinity thing. I have tried with revision 19103 a couple a weeks ago, the problem was already there. Part of /proc/cpuinfo is below: processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev: 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz: 900.00 itc MHz: 900.00 BogoMIPS : 1325.40 siblings : 1 The machine is a 60-way Altix machine, so you have 60 times this information in /proc/cpuinfo (yes, 60, not 64). Camille Ralph Castain a écrit : I believe I have found the problem, and it may indeed relate to the change in paffinity. By any chance, do you have unfilled sockets on that machine? Could you provide the output from something like "cat /proc/cpuinfo" (or the equiv for your system) so we could see what physical processors and sockets are present? If I'm correct as to the problem, here is the issue. OMPI has (until now) always assumed that the #logical processors (or sockets, or cores) was the same as the #physical processors (or sockets, or cores). As a result, several key subsystems were written without making any distinction as to which (logical vs physical) they were referring to. This was no problem until we recently encountered systems with "holes" in their system - a processor turned "off", or a socket unpopulated, etc. In this case, the local processor id no longer matches the physical processor id (ditto for sockets and cores). We adjusted the paffinity subsystem to deal with it - took much more effort than we would have liked, and exposed lots of inconsistencies in how the base operating systems handle such situations. Unfortunately, having gotten that straightened out, it is possible that you have uncovered a similar inconsistency in logical vs physical in another subsystem. I have asked better eyes than mine to take a look at that now to confirm - if so, it could take us a little while to fix. My request for info was aimed at helping us to determine why your system is seeing this problem, but our tests didn't. We have tested the revised paffinity on both completely filled and on at least one system with "holes", but differences in OS levels, processor types, etc could have caused our tests to pass while your system fails. I'm particularly suspicious of the old kernel you are running and how our revised code will handle it. For now, I would suggest you work with revisions lower than r19391 - could you please confirm that r19390 or earlier works? Thanks Ralph On Aug 22, 2008, at 7:21 AM, Camille Coti wrote: OK, thank you! Camille Ralph Castain a écrit : Okay, I'll look into it. I suspect the problem is due to the redefinition of the paffinity API to clarify physical vs logical processors - more than likely, the maffinity interface suffers from the same problem we had to correct over there. We'll report back later with an estimate of how quickly this can be fixed. Thanks Ralph On Aug 22, 2008, at 7:03 AM, Camille Coti wrote: Ralph, I compiled a clean checkout from the trunk (r19392), the problem is still the same. Camille Ralph Castain a écrit : Hi Camille What OMPI version are you using? We just changed the paffinity module last night, but did nothing to maffinity. However, it is possible that the maffinity framework makes some calls into paffinity that need to adjust. So version number would help a great deal in this case. Thanks Ralph On Aug 22, 2008, at 5:23 AM, Camille Coti wrote: Hello, I am trying to run applications on a shared-memory machine. For the moment I am just trying to run tests on point-to- point communications (a trivial token ring) and collective operations (from the SkaMPI tests suite). It runs smoothly if mpi_paffinity_