date:20130128

Re: [OMPI users] Error when attempting to run LAMMPS on Centos 6.2 with OpenMPI

2013-01-28 Thread #YEO JINGJIE#

I obtained exactly the same error:

[NTU-2:24680] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
ess_hnp_module.c at line 194
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[NTU-2:24680] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
runtime/orte_init.c at line 128
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  orte_ess_set_name failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[NTU-2:24680] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c 
at line 616

This does seem to be incredibly perplexing, I will attempt a proper 
(non-packaged) installation for my cluster once more and determine whether it 
works. Thank you so much for all the help!

Regards,
Jingjie Yeo
Ph.D. Student
School of Mechanical and Aerospace Engineering
Nanyang Technological University, Singapore


From: Ralph Castain [rhc.open...@gmail.com] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: Monday, 28 January, 2013 12:24:23 AM
To: #YEO JINGJIE#; Open MPI Users
Subject: Re: [OMPI users] Error when attempting to run LAMMPS on Centos 6.2 
with OpenMPI

On Jan 26, 2013, at 11:18 PM, #YEO JINGJIE#  wrote:

> So I should run the job as:
>
> /usr/lib64/openmpi/bin/mpirun -mca mca_component_show_load_errors 1 -n 16 
> /opt/lammps-21Jan13/lmp_linux < zigzag.in
>
> Is that correct?

Yes, thanks - though for our purposes, why don't you simplify it to:


/usr/lib64/openmpi/bin/mpirun -mca mca_component_show_load_errors 1 -n 1 
hostname


>
> Regards,
> Jingjie Yeo
> Ph.D. Student
> School of Mechanical and Aerospace Engineering
> Nanyang Technological University, Singapore
>
> 
> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of 
> Ralph Castain [r...@open-mpi.org]
> Sent: Sunday, 27 January, 2013 11:58:51 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Error when attempting to run LAMMPS on Centos 6.2   
>   with OpenMPI
>
> One thing you might try: add "-mca mca_component_show_load_errors 1" to your 
> mpirun cmd line. This will tell us if the libraries have some missing 
> dependencies.
>
> It's the main reason I dislike installing from a package - the package 
> assumes that your system is configured identically to that of the one used to 
> generate the package. This is rarely the case - much easier to just download 
> an OMPI tarball, configure and compile it yourself.
>
>
> On Jan 26, 2013, at 7:32 PM, #YEO JINGJIE#  wrote:
>
>> Hi Jeff,
>>
>> Sorry the original error info was lost along the way, I'm terribly new to 
>> linux and I am trying to compile OMPI and to run a program, LAMMPS using the 
>> command:
>>
>> /usr/lib64/openmpi/bin/mpirun -n 16 /opt/lammps-21Jan13/lmp_linux < zigzag.in
>>
>> And I received the errors:
>>
>> [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at 
>> line 194
>> --
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> orte_plm_base_select failed
>> --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --
>> [NTU-2:24127] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
>> runtime/orte_init.c at line 128
>> --
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This fail

[OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread Maxime Boissonneault


Hello,
I am doing checkpointing tests (with BLCR) with an MPI application 
compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite 
strange.


First, some details about the tests :
- The only filesystem available on the nodes are 1) one tmpfs, 2) one 
lustre shared filesystem (tested to be able to provide ~15GB/s for 
writing and support ~40k IOPs).
- The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 
2 nodes). Each MPI rank was using approximately 200MB of memory.
- I was doing checkpoints with ompi-checkpoint and restarting with 
ompi-restart.

- I was starting with mpirun -am ft-enable-cr
- The nodes are monitored by ganglia, which allows me to see the number 
of IOPs and the read/write speed on the filesystem.


I tried a few different mca settings, but I consistently observed that :
- The checkpoints lasted ~4-5 minutes each time
- During checkpoint, each node (8 ranks) was doing ~500 IOPs, and 
writing at ~15MB/s.


I am worried by the number of IOPs and the very slow writing speed. This 
was a very small test. We have jobs running with 128 or 256 MPI ranks, 
each using 1-2 GB of ram per rank. With such jobs, the overall number of 
IOPs would reach tens of thousands and would completely overload our 
lustre filesystem. Moreover, with 15MB/s per node, the checkpointing 
process would take hours.


How can I improve on that ? Is there an MCA setting that I am missing ?

Thanks,

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread Ralph Castain

Our c/r person has moved on to a different career path, so we may not have 
anyone who can answer this question.

What we can say is that checkpointing at any significant scale will always be a 
losing proposition. It just takes too long and hammers the file system. People 
have been working on extending the capability with things like "burst buffers" 
(basically putting an SSD in front of the file system to absorb the checkpoint 
surge), but that hasn't become very common yet.

Frankly, what people have found to be the "best" solution is for your app to 
periodically write out its intermediate results, and then take a flag that 
indicates "read prior results" when it starts. This minimizes the amount of 
data being written to the disk. If done correctly, you would only lose whatever 
work was done since the last intermediate result was written - which is about 
equivalent to losing whatever works was done since the last checkpoint.

HTH
Ralph

On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault 
 wrote:

> Hello,
> I am doing checkpointing tests (with BLCR) with an MPI application compiled 
> with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.
> 
> First, some details about the tests :
> - The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre 
> shared filesystem (tested to be able to provide ~15GB/s for writing and 
> support ~40k IOPs).
> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2 
> nodes). Each MPI rank was using approximately 200MB of memory.
> - I was doing checkpoints with ompi-checkpoint and restarting with 
> ompi-restart.
> - I was starting with mpirun -am ft-enable-cr
> - The nodes are monitored by ganglia, which allows me to see the number of 
> IOPs and the read/write speed on the filesystem.
> 
> I tried a few different mca settings, but I consistently observed that :
> - The checkpoints lasted ~4-5 minutes each time
> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at 
> ~15MB/s.
> 
> I am worried by the number of IOPs and the very slow writing speed. This was 
> a very small test. We have jobs running with 128 or 256 MPI ranks, each using 
> 1-2 GB of ram per rank. With such jobs, the overall number of IOPs would 
> reach tens of thousands and would completely overload our lustre filesystem. 
> Moreover, with 15MB/s per node, the checkpointing process would take hours.
> 
> How can I improve on that ? Is there an MCA setting that I am missing ?
> 
> Thanks,
> 
> -- 
> -
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread Maxime Boissonneault

Hello Ralph,
I agree that ideally, someone would implement checkpointing in the
application itself, but that is not always possible (commercial
applications, use of complicated libraries, algorithms with no clear
progression points at which you can interrupt the algorithm and start it
back from there).

There certainly must be a better way to write the information down on
disc though. Doing 500 IOPs seems to be completely broken. Why isn't
there buffering involved ?

Thanks,

Maxime

Le 2013-01-28 10:58, Ralph Castain a écrit :

Our c/r person has moved on to a different career path, so we may not have
anyone who can answer this question.

What we can say is that checkpointing at any significant scale will always be a losing
proposition. It just takes too long and hammers the file system. People have been working
on extending the capability with things like "burst buffers" (basically putting
an SSD in front of the file system to absorb the checkpoint surge), but that hasn't
become very common yet.

Frankly, what people have found to be the "best" solution is for your app to periodically
write out its intermediate results, and then take a flag that indicates "read prior
results" when it starts. This minimizes the amount of data being written to the disk. If done
correctly, you would only lose whatever work was done since the last intermediate result was
written - which is about equivalent to losing whatever works was done since the last checkpoint.

HTH
Ralph

On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault
wrote:

Hello,
I am doing checkpointing tests (with BLCR) with an MPI application compiled
with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.

First, some details about the tests :
- The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre
shared filesystem (tested to be able to provide ~15GB/s for writing and support
~40k IOPs).
- The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2
nodes). Each MPI rank was using approximately 200MB of memory.
- I was doing checkpoints with ompi-checkpoint and restarting with ompi-restart.
- I was starting with mpirun -am ft-enable-cr
- The nodes are monitored by ganglia, which allows me to see the number of IOPs
and the read/write speed on the filesystem.

I tried a few different mca settings, but I consistently observed that :
- The checkpoints lasted ~4-5 minutes each time
- During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at
~15MB/s.

I am worried by the number of IOPs and the very slow writing speed. This was a
very small test. We have jobs running with 128 or 256 MPI ranks, each using 1-2
GB of ram per rank. With such jobs, the overall number of IOPs would reach tens
of thousands and would completely overload our lustre filesystem. Moreover,
with 15MB/s per node, the checkpointing process would take hours.

How can I improve on that ? Is there an MCA setting that I am missing ?

Thanks,

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread Ralph Castain


On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault 
 wrote:

> Hello Ralph,
> I agree that ideally, someone would implement checkpointing in the 
> application itself, but that is not always possible (commercial applications, 
> use of complicated libraries, algorithms with no clear progression points at 
> which you can interrupt the algorithm and start it back from there).

Hmmm...well, most apps can be adjusted to support it - we have some very 
complex apps that were updated that way. Commercial apps are another story, but 
we frankly don't find much call for checkpointing those as they typically just 
don't run long enough - especially if you are only running 256 ranks, so your 
cluster is small. Failure rates just don't justify it in such cases, in our 
experience.

Is there some particular reason why you feel you need checkpointing?

> 
> There certainly must be a better way to write the information down on disc 
> though. Doing 500 IOPs seems to be completely broken. Why isn't there 
> buffering involved ?

I don't know - that's all done in BLCR, I believe. Either way, it isn't 
something we can address due to the loss of our supporter for c/r.

Sorry we can't be of more help :-(
Ralph

> 
> Thanks,
> 
> Maxime
> 
> 
> Le 2013-01-28 10:58, Ralph Castain a écrit :
>> Our c/r person has moved on to a different career path, so we may not have 
>> anyone who can answer this question.
>> 
>> What we can say is that checkpointing at any significant scale will always 
>> be a losing proposition. It just takes too long and hammers the file system. 
>> People have been working on extending the capability with things like "burst 
>> buffers" (basically putting an SSD in front of the file system to absorb the 
>> checkpoint surge), but that hasn't become very common yet.
>> 
>> Frankly, what people have found to be the "best" solution is for your app to 
>> periodically write out its intermediate results, and then take a flag that 
>> indicates "read prior results" when it starts. This minimizes the amount of 
>> data being written to the disk. If done correctly, you would only lose 
>> whatever work was done since the last intermediate result was written - 
>> which is about equivalent to losing whatever works was done since the last 
>> checkpoint.
>> 
>> HTH
>> Ralph
>> 
>> On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault 
>>  wrote:
>> 
>>> Hello,
>>> I am doing checkpointing tests (with BLCR) with an MPI application compiled 
>>> with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.
>>> 
>>> First, some details about the tests :
>>> - The only filesystem available on the nodes are 1) one tmpfs, 2) one 
>>> lustre shared filesystem (tested to be able to provide ~15GB/s for writing 
>>> and support ~40k IOPs).
>>> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2 
>>> nodes). Each MPI rank was using approximately 200MB of memory.
>>> - I was doing checkpoints with ompi-checkpoint and restarting with 
>>> ompi-restart.
>>> - I was starting with mpirun -am ft-enable-cr
>>> - The nodes are monitored by ganglia, which allows me to see the number of 
>>> IOPs and the read/write speed on the filesystem.
>>> 
>>> I tried a few different mca settings, but I consistently observed that :
>>> - The checkpoints lasted ~4-5 minutes each time
>>> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing 
>>> at ~15MB/s.
>>> 
>>> I am worried by the number of IOPs and the very slow writing speed. This 
>>> was a very small test. We have jobs running with 128 or 256 MPI ranks, each 
>>> using 1-2 GB of ram per rank. With such jobs, the overall number of IOPs 
>>> would reach tens of thousands and would completely overload our lustre 
>>> filesystem. Moreover, with 15MB/s per node, the checkpointing process would 
>>> take hours.
>>> 
>>> How can I improve on that ? Is there an MCA setting that I am missing ?
>>> 
>>> Thanks,
>>> 
>>> -- 
>>> -
>>> Maxime Boissonneault
>>> Analyste de calcul - Calcul Québec, Université Laval
>>> Ph. D. en physique
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> -
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
>

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread Maxime Boissonneault

Le 2013-01-28 12:46, Ralph Castain a écrit :

On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
wrote:

Hello Ralph,
I agree that ideally, someone would implement checkpointing in the application
itself, but that is not always possible (commercial applications, use of
complicated libraries, algorithms with no clear progression points at which you
can interrupt the algorithm and start it back from there).

Hmmm...well, most apps can be adjusted to support it - we have some very
complex apps that were updated that way. Commercial apps are another story, but
we frankly don't find much call for checkpointing those as they typically just
don't run long enough - especially if you are only running 256 ranks, so your
cluster is small. Failure rates just don't justify it in such cases, in our
experience.

Is there some particular reason why you feel you need checkpointing?
This specific case is that the jobs run for days. The risk of a hardware
or power failure for that kind of duration goes too high (we allow for
no more than 48 hours of run time).
While it is true we can dig through the code of the library to make it
checkpoint, BLCR checkpointing just seemed easier.

There certainly must be a better way to write the information down on disc
though. Doing 500 IOPs seems to be completely broken. Why isn't there buffering
involved ?

I don't know - that's all done in BLCR, I believe. Either way, it isn't
something we can address due to the loss of our supporter for c/r.

I suppose I should contact BLCR instead then.

Thank you,

Maxime

Sorry we can't be of more help :-(
Ralph

Thanks,

Maxime

Le 2013-01-28 10:58, Ralph Castain a écrit :

Our c/r person has moved on to a different career path, so we may not have
anyone who can answer this question.

HTH
Ralph

On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault
wrote:

Hello,
I am doing checkpointing tests (with BLCR) with an MPI application compiled
with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.

How can I improve on that ? Is there an MCA setting that I am missing ?

Thanks,

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread Ralph Castain

On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault 
 wrote:

> Le 2013-01-28 12:46, Ralph Castain a écrit :
>> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault 
>>  wrote:
>> 
>>> Hello Ralph,
>>> I agree that ideally, someone would implement checkpointing in the 
>>> application itself, but that is not always possible (commercial 
>>> applications, use of complicated libraries, algorithms with no clear 
>>> progression points at which you can interrupt the algorithm and start it 
>>> back from there).
>> Hmmm...well, most apps can be adjusted to support it - we have some very 
>> complex apps that were updated that way. Commercial apps are another story, 
>> but we frankly don't find much call for checkpointing those as they 
>> typically just don't run long enough - especially if you are only running 
>> 256 ranks, so your cluster is small. Failure rates just don't justify it in 
>> such cases, in our experience.
>> 
>> Is there some particular reason why you feel you need checkpointing?
> This specific case is that the jobs run for days. The risk of a hardware or 
> power failure for that kind of duration goes too high (we allow for no more 
> than 48 hours of run time).

I'm surprised by that - we run with UPS support on the clusters, but for a 
small one like you describe, we find the probability that a job will be 
interrupted even during a multi-week app is vanishingly small.

FWIW: I do work with the financial industry where we regularly run apps that 
execute non-stop for about a month with no reported failures. Are you actually 
seeing failures, or are you anticipating them?

> While it is true we can dig through the code of the library to make it 
> checkpoint, BLCR checkpointing just seemed easier.

I see - just be aware that checkpoint support in OMPI will disappear in v1.7 
and there is no clear timetable for restoring it.

>> 
>>> There certainly must be a better way to write the information down on disc 
>>> though. Doing 500 IOPs seems to be completely broken. Why isn't there 
>>> buffering involved ?
>> I don't know - that's all done in BLCR, I believe. Either way, it isn't 
>> something we can address due to the loss of our supporter for c/r.
> I suppose I should contact BLCR instead then.

For the disk op problem, I think that's the way to go - though like I said, I 
could be wrong and the disk writes could be something we do inside OMPI. I'm 
not familiar enough with the c/r code to state it with certainty.

> 
> Thank you,
> 
> Maxime
>> 
>> Sorry we can't be of more help :-(
>> Ralph
>> 
>>> Thanks,
>>> 
>>> Maxime
>>> 
>>> 
>>> Le 2013-01-28 10:58, Ralph Castain a écrit :
 Our c/r person has moved on to a different career path, so we may not have 
 anyone who can answer this question.

 What we can say is that checkpointing at any significant scale will always 
 be a losing proposition. It just takes too long and hammers the file 
 system. People have been working on extending the capability with things 
 like "burst buffers" (basically putting an SSD in front of the file system 
 to absorb the checkpoint surge), but that hasn't become very common yet.

 Frankly, what people have found to be the "best" solution is for your app 
 to periodically write out its intermediate results, and then take a flag 
 that indicates "read prior results" when it starts. This minimizes the 
 amount of data being written to the disk. If done correctly, you would 
 only lose whatever work was done since the last intermediate result was 
 written - which is about equivalent to losing whatever works was done 
 since the last checkpoint.

 HTH
 Ralph

 On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault 
  wrote:

> Hello,
> I am doing checkpointing tests (with BLCR) with an MPI application 
> compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite 
> strange.
> 
> First, some details about the tests :
> - The only filesystem available on the nodes are 1) one tmpfs, 2) one 
> lustre shared filesystem (tested to be able to provide ~15GB/s for 
> writing and support ~40k IOPs).
> - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 
> 2 nodes). Each MPI rank was using approximately 200MB of memory.
> - I was doing checkpoints with ompi-checkpoint and restarting with 
> ompi-restart.
> - I was starting with mpirun -am ft-enable-cr
> - The nodes are monitored by ganglia, which allows me to see the number 
> of IOPs and the read/write speed on the filesystem.
> 
> I tried a few different mca settings, but I consistently observed that :
> - The checkpoints lasted ~4-5 minutes each time
> - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing 
> at ~15MB/s.
> 
> I am worried by the number of IOPs and the very slow writing speed. This 
> was a very small test. We have jobs

Re: [OMPI users] very low performance over infiniband

2013-01-28 Thread Shamis, Pavel

Also make sure that processes were not swapped out to a hard drive.

Pavel (Pasha) Shamis
---
Computer Science Research Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Jan 27, 2013, at 6:39 AM, John Hearns 
mailto:hear...@googlemail.com>> wrote:


2 percent?

Have you logged into a compute node and run a simple top when the job is 
running?
Are all the processes distributed across the CPU cores?
Are the processes being pinned properly to a core? Or are they hopping from 
core to core?

Also make SURE all nodes havenooted with all cores online and all report the 
same amount of RAM

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread Maxime Boissonneault

Le 2013-01-28 13:15, Ralph Castain a écrit :

On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault
wrote:

Le 2013-01-28 12:46, Ralph Castain a écrit :

On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
wrote:

Hello Ralph,
I agree that ideally, someone would implement checkpointing in the application
itself, but that is not always possible (commercial applications, use of
complicated libraries, algorithms with no clear progression points at which you
can interrupt the algorithm and start it back from there).

Is there some particular reason why you feel you need checkpointing?

This specific case is that the jobs run for days. The risk of a hardware or
power failure for that kind of duration goes too high (we allow for no more
than 48 hours of run time).

I'm surprised by that - we run with UPS support on the clusters, but for a
small one like you describe, we find the probability that a job will be
interrupted even during a multi-week app is vanishingly small.

FWIW: I do work with the financial industry where we regularly run apps that
execute non-stop for about a month with no reported failures. Are you actually
seeing failures, or are you anticipating them?
While our filesystem and management nodes are on UPS, our compute nodes
are not. With one average generic (power/cooling mostly) failure every
one or two months, running for weeks is just asking for trouble. If you
add to that typical dimm/cpu/networking failures (I estimated about 1
node goes down per day because of some sort hardware failure, for a
cluster of 960 nodes). With these numbers, a job running on 32 nodes for
7 days has a ~35% chance of failing before it is done.

Having 24GB of ram per node, even if a 32 nodes job takes close to 100%
of the ram, that's merely 640 GB of data. Writing that on a lustre
filesystem capable of reaching ~15GB/s should take no more than a few
minutes if written correctly. Right now, I am getting a few minutes for
a hundredth of this amount of data!

While it is true we can dig through the code of the library to make it
checkpoint, BLCR checkpointing just seemed easier.

I see - just be aware that checkpoint support in OMPI will disappear in v1.7
and there is no clear timetable for restoring it.

That is very good to know. Thanks for the information. It is too bad though.

There certainly must be a better way to write the information down on disc
though. Doing 500 IOPs seems to be completely broken. Why isn't there buffering
involved ?

I don't know - that's all done in BLCR, I believe. Either way, it isn't
something we can address due to the loss of our supporter for c/r.

I suppose I should contact BLCR instead then.

For the disk op problem, I think that's the way to go - though like I said, I
could be wrong and the disk writes could be something we do inside OMPI. I'm
not familiar enough with the c/r code to state it with certainty.

Thank you,

Maxime

Sorry we can't be of more help :-(
Ralph

Thanks,

Maxime

Le 2013-01-28 10:58, Ralph Castain a écrit :

Our c/r person has moved on to a different career path, so we may not have
anyone who can answer this question.

HTH
Ralph

On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault
wrote:

Hello,
I am doing checkpointing tests (with BLCR) with an MPI application compiled
with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread Ralph Castain


On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault 
 wrote:

> Le 2013-01-28 13:15, Ralph Castain a écrit :
>> On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault 
>>  wrote:
>> 
>>> Le 2013-01-28 12:46, Ralph Castain a écrit :
 On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault 
  wrote:
 
> Hello Ralph,
> I agree that ideally, someone would implement checkpointing in the 
> application itself, but that is not always possible (commercial 
> applications, use of complicated libraries, algorithms with no clear 
> progression points at which you can interrupt the algorithm and start it 
> back from there).
 Hmmm...well, most apps can be adjusted to support it - we have some very 
 complex apps that were updated that way. Commercial apps are another 
 story, but we frankly don't find much call for checkpointing those as they 
 typically just don't run long enough - especially if you are only running 
 256 ranks, so your cluster is small. Failure rates just don't justify it 
 in such cases, in our experience.
 
 Is there some particular reason why you feel you need checkpointing?
>>> This specific case is that the jobs run for days. The risk of a hardware or 
>>> power failure for that kind of duration goes too high (we allow for no more 
>>> than 48 hours of run time).
>> I'm surprised by that - we run with UPS support on the clusters, but for a 
>> small one like you describe, we find the probability that a job will be 
>> interrupted even during a multi-week app is vanishingly small.
>> 
>> FWIW: I do work with the financial industry where we regularly run apps that 
>> execute non-stop for about a month with no reported failures. Are you 
>> actually seeing failures, or are you anticipating them?
> While our filesystem and management nodes are on UPS, our compute nodes are 
> not. With one average generic (power/cooling mostly) failure every one or two 
> months, running for weeks is just asking for trouble.

Wow, that is high

> If you add to that typical dimm/cpu/networking failures (I estimated about 1 
> node goes down per day because of some sort hardware failure, for a cluster 
> of 960 nodes).

That is incredibly high

> With these numbers, a job running on 32 nodes for 7 days has a ~35% chance of 
> failing before it is done.

I've never seen anything like that behavior in practice - a 32 node cluster 
typically runs for quite a few months without a failure. It is a typical size 
for the financial sector, so we have a LOT of experience with such clusters.

I suspect you won't see anything like that behavior...

> 
> Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of 
> the ram, that's merely 640 GB of data. Writing that on a lustre filesystem 
> capable of reaching ~15GB/s should take no more than a few minutes if written 
> correctly. Right now, I am getting a few minutes for a hundredth of this 
> amount of data!


Agreed - but I don't think you'll get that bandwidth for checkpointing. I 
suspect you'll find that checkpointing really has troubles when scaling, which 
is why you don't see it used in production (at least, I haven't). Mostly used 
for research by a handful of organizations, which is why we haven't been too 
concerned about its loss.


> 
>>> While it is true we can dig through the code of the library to make it 
>>> checkpoint, BLCR checkpointing just seemed easier.
>> I see - just be aware that checkpoint support in OMPI will disappear in v1.7 
>> and there is no clear timetable for restoring it.
> That is very good to know. Thanks for the information. It is too bad though.
>> 
> There certainly must be a better way to write the information down on 
> disc though. Doing 500 IOPs seems to be completely broken. Why isn't 
> there buffering involved ?
 I don't know - that's all done in BLCR, I believe. Either way, it isn't 
 something we can address due to the loss of our supporter for c/r.
>>> I suppose I should contact BLCR instead then.
>> For the disk op problem, I think that's the way to go - though like I said, 
>> I could be wrong and the disk writes could be something we do inside OMPI. 
>> I'm not familiar enough with the c/r code to state it with certainty.
>> 
>>> Thank you,
>>> 
>>> Maxime
 Sorry we can't be of more help :-(
 Ralph
 
> Thanks,
> 
> Maxime
> 
> 
> Le 2013-01-28 10:58, Ralph Castain a écrit :
>> Our c/r person has moved on to a different career path, so we may not 
>> have anyone who can answer this question.
>> 
>> What we can say is that checkpointing at any significant scale will 
>> always be a losing proposition. It just takes too long and hammers the 
>> file system. People have been working on extending the capability with 
>> things like "burst buffers" (basically putting an SSD in front of the 
>> file system to absorb the checkpoint surge), but that hasn't become very

Re: [OMPI users] very low performance over infiniband

2013-01-28 Thread John Hearns

Have you run ibstat on every single node and made sure all links are  up at
the correct speed?

Have you checkef the output to make sure that you are not domehow running
over ethernet?

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread George Bosilca

At the scale you address you should have no trouble with the C/R if
the file system is correctly configured. We get more bandwidth per
node out of an NFS over 1Gb/s at 32 nodes. Have you run some parallel
benchmarks on your cluster ?

 George.

PS: You can some MPI I/O benchmarks at
http://www.mcs.anl.gov/~thakur/pio-benchmarks.html



On Mon, Jan 28, 2013 at 2:04 PM, Ralph Castain  wrote:
>
> On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault 
>  wrote:
>
>> Le 2013-01-28 13:15, Ralph Castain a écrit :
>>> On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault 
>>>  wrote:
>>>
 Le 2013-01-28 12:46, Ralph Castain a écrit :
> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault 
>  wrote:
>
>> Hello Ralph,
>> I agree that ideally, someone would implement checkpointing in the 
>> application itself, but that is not always possible (commercial 
>> applications, use of complicated libraries, algorithms with no clear 
>> progression points at which you can interrupt the algorithm and start it 
>> back from there).
> Hmmm...well, most apps can be adjusted to support it - we have some very 
> complex apps that were updated that way. Commercial apps are another 
> story, but we frankly don't find much call for checkpointing those as 
> they typically just don't run long enough - especially if you are only 
> running 256 ranks, so your cluster is small. Failure rates just don't 
> justify it in such cases, in our experience.
>
> Is there some particular reason why you feel you need checkpointing?
 This specific case is that the jobs run for days. The risk of a hardware 
 or power failure for that kind of duration goes too high (we allow for no 
 more than 48 hours of run time).
>>> I'm surprised by that - we run with UPS support on the clusters, but for a 
>>> small one like you describe, we find the probability that a job will be 
>>> interrupted even during a multi-week app is vanishingly small.
>>>
>>> FWIW: I do work with the financial industry where we regularly run apps 
>>> that execute non-stop for about a month with no reported failures. Are you 
>>> actually seeing failures, or are you anticipating them?
>> While our filesystem and management nodes are on UPS, our compute nodes are 
>> not. With one average generic (power/cooling mostly) failure every one or 
>> two months, running for weeks is just asking for trouble.
>
> Wow, that is high
>
>> If you add to that typical dimm/cpu/networking failures (I estimated about 1 
>> node goes down per day because of some sort hardware failure, for a cluster 
>> of 960 nodes).
>
> That is incredibly high
>
>> With these numbers, a job running on 32 nodes for 7 days has a ~35% chance 
>> of failing before it is done.
>
> I've never seen anything like that behavior in practice - a 32 node cluster 
> typically runs for quite a few months without a failure. It is a typical size 
> for the financial sector, so we have a LOT of experience with such clusters.
>
> I suspect you won't see anything like that behavior...
>
>>
>> Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of 
>> the ram, that's merely 640 GB of data. Writing that on a lustre filesystem 
>> capable of reaching ~15GB/s should take no more than a few minutes if 
>> written correctly. Right now, I am getting a few minutes for a hundredth of 
>> this amount of data!
>
>
> Agreed - but I don't think you'll get that bandwidth for checkpointing. I 
> suspect you'll find that checkpointing really has troubles when scaling, 
> which is why you don't see it used in production (at least, I haven't). 
> Mostly used for research by a handful of organizations, which is why we 
> haven't been too concerned about its loss.
>
>
>>
 While it is true we can dig through the code of the library to make it 
 checkpoint, BLCR checkpointing just seemed easier.
>>> I see - just be aware that checkpoint support in OMPI will disappear in 
>>> v1.7 and there is no clear timetable for restoring it.
>> That is very good to know. Thanks for the information. It is too bad though.
>>>
>> There certainly must be a better way to write the information down on 
>> disc though. Doing 500 IOPs seems to be completely broken. Why isn't 
>> there buffering involved ?
> I don't know - that's all done in BLCR, I believe. Either way, it isn't 
> something we can address due to the loss of our supporter for c/r.
 I suppose I should contact BLCR instead then.
>>> For the disk op problem, I think that's the way to go - though like I said, 
>>> I could be wrong and the disk writes could be something we do inside OMPI. 
>>> I'm not familiar enough with the c/r code to state it with certainty.
>>>
 Thank you,

 Maxime
> Sorry we can't be of more help :-(
> Ralph
>
>> Thanks,
>>
>> Maxime
>>
>>
>> Le 2013-01-28 10:58, Ralph Castain a écrit :
>>> Our c/r person has moved on to

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread Maxime Boissonneault


Hi George,
The problem here is not the bandwidth, but the number of IOPs. I wrote 
to the BLCR list, and they confirmed that :
"While ideally the checkpoint would be written in sizable chunks, the 
current code in BLCR will issue a single write operation for each 
contiguous range of user memory, and many quite small writes for various 
meta-data and non-memory state (registers, signal-handlers,etc).  As 
show in Table 1 of the paper cited above, the writes in the 10s of KB 
range will dominate performance."


(Reference being : X. Ouyang, R. Rajachandrasekhar, X. Besseron, H. 
Wang, J. Huang and D. K. Panda, CRFS: A Lightweight User-Level 
Filesystem for Generic Checkpoint/Restart, Int'l Conference on Parallel 
Processing (ICPP '11), Sept. 2011. (PDF 
))


We did run parallel IO benchmarks. Our filesystem can reach a speed of 
~15GB/s, but only with large IO operations (at least bigger than 1MB, 
optimally in the 100MB-1GB range). For small (<1MB) operations, the 
filesystem is considerably slower. I believe this is precisely what is 
killing the performance here.


Not sure there is anything to be done about it.

Best regards,


Maxime

Le 2013-01-28 15:40, George Bosilca a écrit :

At the scale you address you should have no trouble with the C/R if
the file system is correctly configured. We get more bandwidth per
node out of an NFS over 1Gb/s at 32 nodes. Have you run some parallel
benchmarks on your cluster ?

  George.

PS: You can some MPI I/O benchmarks at
http://www.mcs.anl.gov/~thakur/pio-benchmarks.html



On Mon, Jan 28, 2013 at 2:04 PM, Ralph Castain  wrote:

On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault 
 wrote:


Le 2013-01-28 13:15, Ralph Castain a écrit :

On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault 
 wrote:


Le 2013-01-28 12:46, Ralph Castain a écrit :

On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault 
 wrote:


Hello Ralph,
I agree that ideally, someone would implement checkpointing in the application 
itself, but that is not always possible (commercial applications, use of 
complicated libraries, algorithms with no clear progression points at which you 
can interrupt the algorithm and start it back from there).

Hmmm...well, most apps can be adjusted to support it - we have some very 
complex apps that were updated that way. Commercial apps are another story, but 
we frankly don't find much call for checkpointing those as they typically just 
don't run long enough - especially if you are only running 256 ranks, so your 
cluster is small. Failure rates just don't justify it in such cases, in our 
experience.

Is there some particular reason why you feel you need checkpointing?

This specific case is that the jobs run for days. The risk of a hardware or 
power failure for that kind of duration goes too high (we allow for no more 
than 48 hours of run time).

I'm surprised by that - we run with UPS support on the clusters, but for a 
small one like you describe, we find the probability that a job will be 
interrupted even during a multi-week app is vanishingly small.

FWIW: I do work with the financial industry where we regularly run apps that 
execute non-stop for about a month with no reported failures. Are you actually 
seeing failures, or are you anticipating them?

While our filesystem and management nodes are on UPS, our compute nodes are 
not. With one average generic (power/cooling mostly) failure every one or two 
months, running for weeks is just asking for trouble.

Wow, that is high


If you add to that typical dimm/cpu/networking failures (I estimated about 1 
node goes down per day because of some sort hardware failure, for a cluster of 
960 nodes).

That is incredibly high


With these numbers, a job running on 32 nodes for 7 days has a ~35% chance of 
failing before it is done.

I've never seen anything like that behavior in practice - a 32 node cluster 
typically runs for quite a few months without a failure. It is a typical size 
for the financial sector, so we have a LOT of experience with such clusters.

I suspect you won't see anything like that behavior...


Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of the 
ram, that's merely 640 GB of data. Writing that on a lustre filesystem capable 
of reaching ~15GB/s should take no more than a few minutes if written 
correctly. Right now, I am getting a few minutes for a hundredth of this amount 
of data!


Agreed - but I don't think you'll get that bandwidth for checkpointing. I 
suspect you'll find that checkpointing really has troubles when scaling, which 
is why you don't see it used in production (at least, I haven't). Mostly used 
for research by a handful of organizations, which is why we haven't been too 
concerned about its loss.



While it is true we can dig through the code of the library to make it 
checkpoint, BLCR checkpointing just seemed easier.

I see - just be aware that checkpoint su

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-28 Thread George Bosilca

Based on the paper you linked the answer is quite obvious. The
proposed CRFS mechanism supports all of the checkpoint-enabled MPI
implementation, thus you just have to go with the one providing and
caring about the services you need.

  George.

On Mon, Jan 28, 2013 at 3:46 PM, Maxime Boissonneault
 wrote:
> Hi George,
> The problem here is not the bandwidth, but the number of IOPs. I wrote to
> the BLCR list, and they confirmed that :
> "While ideally the checkpoint would be written in sizable chunks, the
> current code in BLCR will issue a single write operation for each contiguous
> range of user memory, and many quite small writes for various meta-data and
> non-memory state (registers, signal-handlers,etc).  As show in Table 1 of
> the paper cited above, the writes in the 10s of KB range will dominate
> performance."
>
> (Reference being : X. Ouyang, R. Rajachandrasekhar, X. Besseron, H. Wang, J.
> Huang and D. K. Panda, CRFS: A Lightweight User-Level Filesystem for Generic
> Checkpoint/Restart, Int'l Conference on Parallel Processing (ICPP '11),
> Sept. 2011. (PDF))
>
> We did run parallel IO benchmarks. Our filesystem can reach a speed of
> ~15GB/s, but only with large IO operations (at least bigger than 1MB,
> optimally in the 100MB-1GB range). For small (<1MB) operations, the
> filesystem is considerably slower. I believe this is precisely what is
> killing the performance here.
>
> Not sure there is anything to be done about it.
>
> Best regards,
>
>
> Maxime
>
> Le 2013-01-28 15:40, George Bosilca a écrit :
>
> At the scale you address you should have no trouble with the C/R if
> the file system is correctly configured. We get more bandwidth per
> node out of an NFS over 1Gb/s at 32 nodes. Have you run some parallel
> benchmarks on your cluster ?
>
>  George.
>
> PS: You can some MPI I/O benchmarks at
> http://www.mcs.anl.gov/~thakur/pio-benchmarks.html
>
>
>
> On Mon, Jan 28, 2013 at 2:04 PM, Ralph Castain  wrote:
>
> On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault
>  wrote:
>
> Le 2013-01-28 13:15, Ralph Castain a écrit :
>
> On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault
>  wrote:
>
> Le 2013-01-28 12:46, Ralph Castain a écrit :
>
> On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
>  wrote:
>
> Hello Ralph,
> I agree that ideally, someone would implement checkpointing in the
> application itself, but that is not always possible (commercial
> applications, use of complicated libraries, algorithms with no clear
> progression points at which you can interrupt the algorithm and start it
> back from there).
>
> Hmmm...well, most apps can be adjusted to support it - we have some very
> complex apps that were updated that way. Commercial apps are another story,
> but we frankly don't find much call for checkpointing those as they
> typically just don't run long enough - especially if you are only running
> 256 ranks, so your cluster is small. Failure rates just don't justify it in
> such cases, in our experience.
>
> Is there some particular reason why you feel you need checkpointing?
>
> This specific case is that the jobs run for days. The risk of a hardware or
> power failure for that kind of duration goes too high (we allow for no more
> than 48 hours of run time).
>
> I'm surprised by that - we run with UPS support on the clusters, but for a
> small one like you describe, we find the probability that a job will be
> interrupted even during a multi-week app is vanishingly small.
>
> FWIW: I do work with the financial industry where we regularly run apps that
> execute non-stop for about a month with no reported failures. Are you
> actually seeing failures, or are you anticipating them?
>
> While our filesystem and management nodes are on UPS, our compute nodes are
> not. With one average generic (power/cooling mostly) failure every one or
> two months, running for weeks is just asking for trouble.
>
> Wow, that is high
>
> If you add to that typical dimm/cpu/networking failures (I estimated about 1
> node goes down per day because of some sort hardware failure, for a cluster
> of 960 nodes).
>
> That is incredibly high
>
> With these numbers, a job running on 32 nodes for 7 days has a ~35% chance
> of failing before it is done.
>
> I've never seen anything like that behavior in practice - a 32 node cluster
> typically runs for quite a few months without a failure. It is a typical
> size for the financial sector, so we have a LOT of experience with such
> clusters.
>
> I suspect you won't see anything like that behavior...
>
> Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of
> the ram, that's merely 640 GB of data. Writing that on a lustre filesystem
> capable of reaching ~15GB/s should take no more than a few minutes if
> written correctly. Right now, I am getting a few minutes for a hundredth of
> this amount of data!
>
> Agreed - but I don't think you'll get that bandwidth for checkpointing. I
> suspect you'll find that checkpointing really has troubles when sc

Re: [OMPI users] MPI_THREAD_FUNNELED and enable-mpi-thread-multiple

2013-01-28 Thread Brian Budge

I believe that yes, you have to compile enable-mpi-thread-multiple to
get anything other than SINGLE.

  Brian

On Tue, Jan 22, 2013 at 12:56 PM, Roland Schulz  wrote:
> Hi,
>
> compiling 1.6.1 or 1.6.2 without enable-mpi-thread-multiple returns from
> MPI_Init_thread as provided level MPI_THREAD_SINGLE. Is
> enable-mpi-thread-multiple required even for
> MPI_THREAD_FUNNELED/MPI_THREAD_SERIALIZED?
>
> This question has been asked before:
> http://www.open-mpi.org/community/lists/users/2011/05/16451.php but I
> couldn't find an answer.
>
> Roland
>
> --
> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
> 865-241-1537, ORNL PO BOX 2008 MS6309
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Error when attempting to run LAMMPS on Centos 6.2 with OpenMPI

[OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] very low performance over infiniband

Re: [OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] very low performance over infiniband

Re: [OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] Checkpointing an MPI application with OMPI

Re: [OMPI users] MPI_THREAD_FUNNELED and enable-mpi-thread-multiple

15 matches

Site Navigation

Mail list logo

Footer information