I believe that yes, you have to compile enable-mpi-thread-multiple to
get anything other than SINGLE.
Brian
On Tue, Jan 22, 2013 at 12:56 PM, Roland Schulz wrote:
> Hi,
>
> compiling 1.6.1 or 1.6.2 without enable-mpi-thread-multiple returns from
> MPI_Init_thread as provided level MPI_THREAD_S
Based on the paper you linked the answer is quite obvious. The
proposed CRFS mechanism supports all of the checkpoint-enabled MPI
implementation, thus you just have to go with the one providing and
caring about the services you need.
George.
On Mon, Jan 28, 2013 at 3:46 PM, Maxime Boissonneault
Hi George,
The problem here is not the bandwidth, but the number of IOPs. I wrote
to the BLCR list, and they confirmed that :
"While ideally the checkpoint would be written in sizable chunks, the
current code in BLCR will issue a single write operation for each
contiguous range of user memory,
At the scale you address you should have no trouble with the C/R if
the file system is correctly configured. We get more bandwidth per
node out of an NFS over 1Gb/s at 32 nodes. Have you run some parallel
benchmarks on your cluster ?
George.
PS: You can some MPI I/O benchmarks at
http://www.mcs.
Have you run ibstat on every single node and made sure all links are up at
the correct speed?
Have you checkef the output to make sure that you are not domehow running
over ethernet?
On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault
wrote:
> Le 2013-01-28 13:15, Ralph Castain a écrit :
>> On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault
>> wrote:
>>
>>> Le 2013-01-28 12:46, Ralph Castain a écrit :
On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
wrote:
>>>
Le 2013-01-28 13:15, Ralph Castain a écrit :
On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault
wrote:
Le 2013-01-28 12:46, Ralph Castain a écrit :
On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
wrote:
Hello Ralph,
I agree that ideally, someone would implement checkpointing in the appl
Also make sure that processes were not swapped out to a hard drive.
Pavel (Pasha) Shamis
---
Computer Science Research Group
Computer Science and Math Division
Oak Ridge National Laboratory
On Jan 27, 2013, at 6:39 AM, John Hearns
mailto:hear...@googlemail.com>> wrote:
2 percent?
Have yo
On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault
wrote:
> Le 2013-01-28 12:46, Ralph Castain a écrit :
>> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
>> wrote:
>>
>>> Hello Ralph,
>>> I agree that ideally, someone would implement checkpointing in the
>>> application itself, but that
Le 2013-01-28 12:46, Ralph Castain a écrit :
On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
wrote:
Hello Ralph,
I agree that ideally, someone would implement checkpointing in the application
itself, but that is not always possible (commercial applications, use of
complicated libraries, a
On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
wrote:
> Hello Ralph,
> I agree that ideally, someone would implement checkpointing in the
> application itself, but that is not always possible (commercial applications,
> use of complicated libraries, algorithms with no clear progression poi
Hello Ralph,
I agree that ideally, someone would implement checkpointing in the
application itself, but that is not always possible (commercial
applications, use of complicated libraries, algorithms with no clear
progression points at which you can interrupt the algorithm and start it
back fro
Our c/r person has moved on to a different career path, so we may not have
anyone who can answer this question.
What we can say is that checkpointing at any significant scale will always be a
losing proposition. It just takes too long and hammers the file system. People
have been working on ext
Hello,
I am doing checkpointing tests (with BLCR) with an MPI application
compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite
strange.
First, some details about the tests :
- The only filesystem available on the nodes are 1) one tmpfs, 2) one
lustre shared filesystem (tested
I obtained exactly the same error:
[NTU-2:24680] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
ess_hnp_module.c at line 194
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.
15 matches
Mail list logo