Re: [OMPI users] Maximum message size for MPI_Send()/MPI_Recv() functions
Hi George, Allocating memory is one thing. Being able to use it it's a completely different story. Once you allocate the 8GB array can you fill it with some random values ? This will force the kernel to really give you the 8GB of memory. If this segfault, then that's the problem. If not ... the problem come from Open MPI I guess. Yes I can fill the buffer entirely with dummy value to ensure that the memory allocated is actually used, so I don't think the problem is in the OS. Cheers, Juan-Carlos. Thanks, george. On Aug 2, 2007, at 6:59 PM, Juan Carlos Guzman wrote: Jelena, George, Thanks for your replies. it is possible that the problem is not in MPI - I've seen similar problem on some of our workstations some time ago. Juan, are you sure you can allocate more than 2x 4GB memory of data in non-mpi program on your system? Yes, I did a small program that can allocate more than 8 GB of memory (using malloc()). Cheers, Juan-Carlos. Thanks, Jelena On Wed, 1 Aug 2007, George Bosilca wrote: Juan, I have to check to see what's wrong there. We build Open MPI with full support for data transfer up to sizeof(size_t) bytes. so you case should be covered. However, there are some known problems with the MPI interface for data larger than sizeof(int). As an example the _count field in the MPI_Status structure will be truncated ... Thanks, george. On Jul 30, 2007, at 1:47 AM, Juan Carlos Guzman wrote: Hi, Does anyone know the maximum buffer size I can use in MPI_Send() (MPI_Recv) function?. I was doing some testing using two nodes on my cluster to measure the point-to-point MPI message rate depending on size. The test program exchanges MPI_FLOAT datatypes between two nodes. I was able to send up to 4 GB of data (500 Mega MPI_FLOATs) before the process crashed with a segmentation fault message. Is the maximum size of the message limited by the sizeof(int) * sizeof (MPI data type) used in the MPI_Send()/MPI_Recv() functions? My cluster has openmpi 1.2.3 installed. Each node has 2 x dual core AMD Opteron and 12 GB RAM. Thanks in advance. Juan-Carlos. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jelena Pjesivac-Grbovic, Pjesa Graduate Research Assistant Innovative Computing Laboratory Computer Science Department, UTK Claxton Complex 350 (865) 974 - 6722 (865) 974 - 6321 jpjes...@utk.edu "The only difference between a problem and a solution is that people understand the solution." -- Charles Kettering -- Message: 2 Date: Wed, 1 Aug 2007 15:06:56 -0500 From: "Adams, Samuel D Contr AFRL/HEDR" Subject: Re: [OMPI users] torque and openmpi To: "Open MPI Users" Message-ID: <8bf06a36e7ad424197195998d9a0b8e1d77...@fbrmlbr01.enterprise.afmc.ds . a f.mil> Content-Type: text/plain; charset="us-ascii" I reran the configure script with the --with-tm flag this time. Thanks for the info. It was working before for clients with ssh properly configured (i.e. my account only). But now it is working without having to use ssh for all accounts (i.e. biologist and physicists users). Sam Adams General Dynamics Information Technology Phone: 210.536.5945 -Original Message- From: users-boun...@open-mpi.org [mailto:users-bounces@open- mpi.org] On Behalf Of Jeff Squyres Sent: Friday, July 27, 2007 2:58 PM To: Open MPI Users Subject: Re: [OMPI users] torque and openmpi On Jul 27, 2007, at 2:48 PM, Galen Shipman wrote: I set up ompi before I configured Torque. Do I need to recompile ompi with appropriate torque configure options to get better integration? If libtorque wasn't present on the machine at configure then yes, you need to run: ./configure --with-tm= You don't *have* to do this, of course. If you've got it working with ssh, that's fine. But the integration with torque can be better: - you can disable ssh for non-root accounts (assuming no other services need rsh/ssh) - users don't have to setup ssh keys to run MPI jobs (a small thing, but sometimes nice when the users aren't computer scientists) - torque knows about all processes on all nodes (not just the mother superior) and can therefore both track and kill them if necessary Just my $0.02... -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Message: 3 Date: Wed, 1 Aug 2007 20:58:44 -0400 From: Jeff Squyres Subject: Re: [OMPI users] unable to compile open mpi using pgf90 in AMD opteron system To: Open MPI Users Message-ID: <5453c030-b7c9-48e1-bba7-f04bcc43c...@cisco.com> Content-Type: text/plain; charset=US-ASCI
Re: [OMPI users] values of mca parameters whilst running program
Glenn, If the error message is about "privileged" memory, i.e. locked or pinned memory, on Solaris you can increase the amount of available privileged memory by editing the /etc/project file on the nodes. Amount available (example of typical value is 900MB): % prctl -n project.max-device-locked-memory -i project default Edit /etc/project: Default line of interest : default:3 Change to, for example 4GB : default:3project.max-device-locked-memory=(priv,4197152000,deny) What to set ompi_free_list_max to? By default each connection will post 8 recs, at 7 sends, 32 rdma writes and possibly a few internal control messages. Since these are pulling from the same free list I believe a sufficient value could be calculated as : 50 * (np - 1). Memory will still be consumed but this should lesson the amount of privileged memory required. Memory consumption is something Sun is actively investigating. What size job are you running? Not sure if this part of the issue but another possiblity, if the communication pattern of the MPI job is actually starving one connection out of memory you could try setting "--mca mpi_preconnect_all 1" and "--mca btl_udapl_max_eager_rdma_peers X", where X is equal to np. This will establish a connection between all processes in the job as well as create a channel for short messages to use rdma functionality. By establishing this channel to all connections before the MPI job starts up each peer connection will be gauranteed some amount of privilege memory over which it could potentially communicate. Of course you do take the hit of wireup time for all connections at MPI_Init. -DON Brian Barrett wrote: On Aug 2, 2007, at 4:22 PM, Glenn Carver wrote: Hopefully an easy question to answer... is it possible to get at the values of mca parameters whilst a program is running? What I had in mind was either an open-mpi function to call which would print the current values of mca parameters or a function to call for specific mca parameters. I don't want to interrupt the running of the application. Bit of background. I have a large F90 application running with OpenMPI (as Sun Clustertools 7) on Opteron CPUs with an IB network. We're seeing swap thrashing occurring on some of the nodes at times and having searched the archives and read the FAQ believe we may be seeing the problem described in: http://www.open-mpi.org/community/lists/users/2007/01/2511.php where the udapl free list is growing to a point where lockable memory runs out. Problem is, I have no feel for the kinds of numbers that "btl_udapl_free_list_max" might safely get up to? Hence the request to print mca parameter values whilst the program is running to see if we can tie in high values of this parameter to when we're seeing swap thrashing. Good news, the answer is easy. Bad news is, it's not the one you want. btl_udapl_free_list_max is the *greatest* the list will ever be allowed to grow to, not it's current size. So if you don't specify a value and use the default of -1, it will return -1 for the life of the application, regardless of how big those free lists actually get. If you specify value X, it'll return X for the lift of the application, as well. There is not a good way for a user to find out the current size of a free list or the largest it got for the life of an application (currently those two will always be the same, but that's another story). Your best bet is to set the parameter to some value (say, 128 or 256) and see if that helps with the swapping. Brian
Re: [OMPI users] values of mca parameters whilst running program
Hi Don, If the error message is about "privileged" memory, i.e. locked or We don't actually get an error message. What we see is the system gradually losing free memory whilst running batch jobs, until such point where it begins swapping like mad and performance plummets (this happens on all nodes). We are still investigating and I wouldn't want to bother this list until we have a clearer idea of what's going on. But oddly, when the job finishes, we don't seem to get all the memory back (but a reboot fixes it). We are running fortran codes (not renowned for mem. leaks) and haven't seen this problem before on other systems we use, nor did we experience it with Clustertools6, only with CT7, which is why we currently suspect problems with the free_list growing too large. pinned memory, on Solaris you can increase the amount of available privileged memory by editing the /etc/project file on the nodes. Amount available (example of typical value is 900MB): % prctl -n project.max-device-locked-memory -i project default Apologies, I'm not familiar with projects in solaris. If I run this command I get: # prctl -n project.max-device-locked-memory -i project default prctl: default: No controllable process found in task, project, or zone. If I run it for one of the processes on the parallel job I get: # prctl -n project.max-device-locked-memory -i pid 6553 process: 6553: ./tomcat NAMEPRIVILEGE VALUEFLAG ACTION RECIPIENT project.max-device-locked-memory privileged 217MB - deny The nodes are X4100s, dual cpu, dual core Opterons with 3.5Gb RAM. Each node therefore runs 4 processes. All nodes are running Solaris 11/06 and up-to-date with patches. Edit /etc/project: Default line of interest : default:3 Change to, for example 4GB : default:3project.max-device-locked-memory=(priv,4197152000,deny) What to set ompi_free_list_max to? By default each connection will post 8 recs, at 7 sends, 32 rdma writes and possibly a few internal control messages. Since these are pulling from the same free list I believe a sufficient value could be calculated as : 50 * (np - 1). Memory will still be consumed but this should lesson the amount of privileged memory required. Thanks, I will give that a try. One question, is 'np' the no. of processes on each node or the total processes for the job? Memory consumption is something Sun is actively investigating. What size job are you running? Each process has a SIZE of just under 800Mb (RES is typically about half, often less, never more). Not sure if this part of the issue but another possiblity, if the communication pattern of the MPI job is actually starving one connection out of memory you could try setting "--mca mpi_preconnect_all 1" and "--mca btl_udapl_max_eager_rdma_peers X", where X is equal to np. This will establish a connection between all processes in the job as well as create a channel for short messages to use rdma functionality. By establishing this channel to all connections before the MPI job starts up each peer connection will be gauranteed some amount of privilege memory over which it could potentially communicate. Of course you do take the hit of wireup time for all connections at MPI_Init. That's a useful tip and may apply in our case as the code configuration giving us trouble writes a lot of data to process 0 for disk output. Thanks, Glenn -DON Brian Barrett wrote: On Aug 2, 2007, at 4:22 PM, Glenn Carver wrote: Hopefully an easy question to answer... is it possible to get at the values of mca parameters whilst a program is running? What I had in mind was either an open-mpi function to call which would print the current values of mca parameters or a function to call for specific mca parameters. I don't want to interrupt the running of the application. Bit of background. I have a large F90 application running with >>OpenMPI (as Sun Clustertools 7) on Opteron CPUs with an IB network. We're seeing swap thrashing occurring on some of the nodes at times and having searched the archives and read the FAQ believe we may be seeing the problem described in: http://www.open-mpi.org/community/lists/users/2007/01/2511.php where the udapl free list is growing to a point where lockable memory runs out. Problem is, I have no feel for the kinds of numbers that "btl_udapl_free_list_max" might safely get up to? Hence the request to print mca parameter values whilst the program is running to see if we can tie in high values of this parameter to when we're seeing swap thrashing. Good news, the answer is easy. Bad news is, it's not the one you want. btl_udapl_free_list_max is the *greatest* the list will ever be allowed to grow to, not it's current size. So if you don't specify a value and use the default of -1, it will return -1 for the life of the application, regardless of how big those free lists actually get. If