from:"Josh Hursey"

Re: [OMPI users] Segfault in ucp_dt_pack function from UCX library 1.8.0 and 1.11.2 for large sized communications using both OpenMPI 4.0.3 and 4.1.2

2022-06-02 Thread Josh Hursey via users

n "large" problems, but I think I could export the data for a 512 processes reproducer with PARMetis call only... Thanks for helping, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Université Laval (418) 656-2131 poste 41 22 42 -- Josh Hursey IBM Spectrum MPI Developer

Re: [OMPI users] PRRTE DVM: how to specify rankfile per prun invocation?

2021-01-11 Thread Josh Hursey via users

ason for not supporting this not-bound configuration when a rankfile is specified? [1] https://stackoverflow.com/questions/32333785/how-to-provide-a-default-slot-list-in-openmpi-rankfile -- Josh Hursey IBM Spectrum MPI Developer

Re: [OMPI users] [ORTE] Connecting back to parent - Forcing tcp port

2021-01-07 Thread Josh Hursey via users

> > wrote: On 18/12/2020 23:04, Josh Hursey wrote: Vincent, Thanks for the details on the bug. Indeed this is a case that seems to have been a problem for a little while now when you use static ports with ORTE (-mca oob_tcp_static_ipv4_ports option). It must have crept in w

Re: [OMPI users] [ORTE] Connecting back to parent - Forcing tcp port

2020-12-18 Thread Josh Hursey via users

class_initialize(cls); 490 } 491 if (NULL != object) { 492 object->obj_class = cls; 493 object->obj_reference_count = 1; 494 opal_obj_run_constructors(object); 495 } 496 return object; 497 } Can you maybe (firstly

Re: [OMPI users] [EXTERNAL] hwloc support for Power9/IBM AC922 servers

2019-04-16 Thread Josh Hursey

er, and want to make sure there are no > known issues between the hardware and software before we make a purchase. > > > > Any feedback will be greatly appreciated. > > > > Thanks, > > > > Prentice > > > > ___ > > users mailing list > > users@

Re: [OMPI users] openmpi 1.10.2 with the IBM lsf system

2017-09-15 Thread Josh Hursey

______ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > -- Josh Hursey IBM Spectrum MPI Developer ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI in docker container

2017-03-11 Thread Josh Hursey

PID 0 on node cn15 exited on > signal 7 (Bus error). > ------ > > Thanks in advance, > > Ender > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- Josh Hursey IBM Spectrum MPI Developer ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] fatal error with openmpi-2.1.0rc1 on Linux with Sun C

2017-02-27 Thread Josh Hursey

or target 'all-recursive' failed > make: *** [all-recursive] Error 1 > loki openmpi-2.1.0rc1-Linux.x86_64.64_cc 129 > > > Gilles, I would be grateful, if you can fix the problem for > openmpi-2.1.0rc1 as well. Thank you very much for your help > in advance. > > > K

Re: [OMPI users] More confusion about --map-by!

2017-02-23 Thread Josh Hursey

: [././././B/B/./././././.][./././././././././././.] > > [somehost:105601] MCW rank 5 bound to socket 1[core 16[hwt 0]], socket > 1[core 17[hwt 0]]: [./././././././././././.][././././B/B/./././././.] > > > > > > Any ideas, please? > > > > Thanks, > > > > Mark > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- Josh Hursey IBM Spectrum MPI Developer ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Can OMPI 1.8.8 or later support LSF 9.1.3 or 10.1?

2016-07-11 Thread Josh Hursey

IBM will be helping to support the LSF functionality in Open MPI. We don't have any detailed documentation just yet, other than the FAQ on the Open MPI site. However, the LSF components in Open MPI should be functional in the latest releases. I've tested recently with LSF 9.1.3 and 10.1. I pushed

Re: [OMPI users] Fw: LSF's LSB_PJL_TASK_GEOMETRY + OpenMPI 1.10.2

2016-04-19 Thread Josh Hursey

onality (without the LSB_PJL_TASK_GEOMETRY variable). -- Josh On Tue, Apr 19, 2016 at 8:57 AM, Josh Hursey wrote: > Farid, > > I have access to the same cluster inside IBM. I can try to help you track > this down and maybe work up a patch with the LSF folks. I'll contact you >

Re: [OMPI users] Fw: LSF's LSB_PJL_TASK_GEOMETRY + OpenMPI 1.10.2

2016-04-19 Thread Josh Hursey

Farid, I have access to the same cluster inside IBM. I can try to help you track this down and maybe work up a patch with the LSF folks. I'll contact you off-list with my IBM address and we can work on this a bit. I'll post back to the list with what we found. -- Josh On Tue, Apr 19, 2016 at 5

Re: [OMPI users] Missing -enable-crdebug option in configure step

2014-07-01 Thread Josh Hursey

The C/R Debugging feature (the ability to do reversible debugging or backward stepping with gdb and/or DDT) was added on 8/10/2010 in the commit below: https://svn.open-mpi.org/trac/ompi/changeset/23587 This feature never made it into a release so it was only ever available on the trunk. However

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-02-05 Thread Josh Hursey

This is a bit late in the thread, but I wanted to add one more note. The functionality that made it to v1.6 is fairly basic in terms of C/R support in Open MPI. It supported a global checkpoint write, and (for a time) a simple staged option (I think that is now broken). In the trunk (about 3 year

Re: [OMPI users] Live process migration

2012-12-12 Thread Josh Hursey

command not found > > Please assist. > > Regards - Ifeanyi > > > > On Wed, Dec 12, 2012 at 3:19 AM, Josh Hursey wrote: > >> Process migration was implemented in Open MPI and working in the trunk a >> couple of years ago. It has not been well maintained for a few

Re: [OMPI users] BLCR + Qlogic infiniband

2012-12-11 Thread Josh Hursey

With that configure string, Open MPI should fail in configure if it does not find the BLCR libraries. Note that this does not check to make sure the BLCR is loaded as a module in the kernel (you will need to check that manually). The ompi_info command will also show you if C/R is enabled and will

Re: [OMPI users] Live process migration

2012-12-11 Thread Josh Hursey

Process migration was implemented in Open MPI and working in the trunk a couple of years ago. It has not been well maintained for a few years though (hopefully that will change one day). So you can try it, but your results may vary. Some details are at the link below: http://osl.iu.edu/research/

Re: [OMPI users] BLCR + Qlogic infiniband

2012-11-30 Thread Josh Hursey

The openib BTL and BLCR support in Open MPI were working about a year ago (when I last checked). The psm BTL is not supported at the moment though. >From the error, I suspect that we are not fully closing the openib btl driver before the checkpoint thus when we try to restart it is looking for a r

Re: [OMPI users] MCA crs: none (MCA v2.0, API v2.0, Component v1.6.3)

2012-11-30 Thread Josh Hursey

Can you send the config.log and some of the other information described on: http://www.open-mpi.org/community/help/ -- Josh On Wed, Nov 14, 2012 at 6:01 PM, Ifeanyi wrote: > Hi all, > > I got this message when I issued this command: > > root@node1:/home/abolap# ompi_info | grep crs >

Re: [OMPI users] (no subject)

2012-11-30 Thread Josh Hursey

Pramoda, That paper was exploring an application of a proposed extension to the MPI standard for fault tolerance purposes. By default this proposed interface is not provided by Open MPI. We have created a prototype version of Open MPI that includes this extension, and it can be found at the follow

Re: [OMPI users] Best way to map MPI processes to sockets?

2012-11-07 Thread Josh Hursey

In your desired ordering you have rank 0 on (socket,core) (0,0) and rank 1 on (0,2). Is there an architectural reason for that? Meaning are cores 0 and 1 hardware threads in the same core, or is there a cache level (say L2 or L3) connecting cores 0 and 1 separate from cores 2 and 3? hwloc's lstopo

Re: [OMPI users] checkpoint problem

2012-07-28 Thread Josh Hursey

Currently you have to do as Reuti mentioned (use the queuing system, or create a script). We do have a feature request ticket open for this feature if you are interested in following the progress: https://svn.open-mpi.org/trac/ompi/ticket/1961 It has been open for a while, but the feature should

[OMPI users] Re: [OMPI users] 回复: [OMPI users] Fault Tolerant Features in OpenMPI

2012-06-25 Thread Josh Hursey

The official support page for the C/R features is hosted by Indiana University (linked from the Open MPI FAQs): http://osl.iu.edu/research/ft/ompi-cr/ The instructions probably need to be cleaned up (some of the release references are not quite correct any longer). But the following should give

Re: [OMPI users] checkpointing of NPB

2012-06-20 Thread Josh Hursey

Ifeanyi, I am usually the one that responds to checkpoint/restart questions, but unfortunately I do not have time to look into this issue at the moment (and probably won't for at least a few more months). There are a few other developers that work on the checkpoint/restart functionality that might

[OMPI users] Re: [OMPI users] 回复: Re: [OMPI users] 2012/06/18 14:35:07 自动保存草稿

2012-06-20 Thread Josh Hursey

You are correct that the Open MPI project combined the efforts of a few preexisting MPI implementations towards building a single, extensible MPI implementation with the best features of the prior MPI implementations. From the beginning of the project the Open MPI developer community has desired to

Re: [OMPI users] Ompi-restart failed and process migration

2012-04-24 Thread Josh Hursey

nstall openmpi in root ,should I move to > General-user-account ? > > > 寄件者： Josh Hursey > 收件者： Open MPI Users > 寄件日期： 2012/4/24 (週二) 10:58 PM > > 主旨： Re: [OMPI users] Ompi-restart failed and process migration > > On Tue, Apr 24, 2012

Re: [OMPI users] Ompi-restart failed and process migration

2012-04-24 Thread Josh Hursey

RY_PATH -hostfile Hosts \ > ompi_global_snapshot_8873.ckpt/ > but it is Error. Use quotes around the mpirun specific options: ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" -hostfile Hosts ompi_global_snapshot_8873.ckpt or ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH -

Re: [OMPI users] Ompi-restart failed and process migration

2012-04-23 Thread Josh Hursey

t Program contains DLL ? I do not understand what you are trying to ask here. Please rephrase. -- Josh > > > > 寄件者： Josh Hursey > 收件者： Open MPI Users > 寄件日期： 2012/4/23 (週一) 10:51 PM > 主旨： Re: [OMPI users] Ompi-restart failed

Re: [OMPI users] Ompi-restart failed and process migration

2012-04-23 Thread Josh Hursey

I wonder if the LD_LIBRARY_PATH is not being set properly upon restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'. ompi-restart will not pass that variable along for you, so if you are using that to set the BLCR path this might be your problem. A couple solutions: - have the PATH and LD_LI

Re: [OMPI users] ompi-restart failed && ompi-migrate

2012-04-11 Thread Josh Hursey

The 1.5 series does not support process migration, so there is no ompi-migrate option there. This was only contributed to the trunk (1.7 series). However, changes to the runtime environment over the past few months have broken this functionality. It is currently unclear when this will be repaired.

Re: [OMPI users] Segmentation fault when checkpointing

2012-03-29 Thread Josh Hursey

This is a bit of a non-answer, but can you try the 1.5 series (1.5.5 in the current release)? 1.4 is being phased out, and 1.5 will replace it in the near future. 1.5 has a number of C/R related fixes that might help. -- Josh On Thu, Mar 29, 2012 at 1:12 PM, Linton, Tom wrote: > We have a legacy

Re: [OMPI users] MPI_Barrier in Self-checkpointing call

2012-02-15 Thread Josh Hursey

When you receive that callback the MPI has ben put in a quiescent state. As such it does not allow MPI communication until the checkpoint is completely finished. So you cannot call barrier in the checkpoint callback. Since Open MPI did doing a coordinated checkpoint, you can assume that all process

Re: [OMPI users] Strange recursive "no" error message when compiling 1.5 series with fault tolerance enabled

2012-01-26 Thread Josh Hursey

It looks like Jeff beat me too it. The problem was with a missing 'test' in the configure script. I'm not sure how it creeped in there, but the fix is in the pipeline for the next 1.5 release. The ticket to track the progress of this patch is on the following ticket: https://svn.open-mpi.org/trac

Re: [OMPI users] Strange recursive "no" error message when compiling 1.5 series with fault tolerance enabled

2012-01-26 Thread Josh Hursey

Well that is awfully insistent. I have been able to reproduce the problem. Upon initial inspection I don't see the bug, but I'll dig into it today and hopefully have a patch in a bit. Below is a ticket for this bug: https://svn.open-mpi.org/trac/ompi/ticket/2980 I'll let you know what I find out

Re: [OMPI users] MPI_Comm_create with unequal group arguments

2012-01-20 Thread Josh Hursey

> On 20-01-2012 15:26, Josh Hursey wrote: > > That behavior is permitted by the MPI 2.2 standard. It seems that our > documentation is incorrect in this regard. I'll file a bug to fix it. > > Just to clarify, in the MPI 2.2 standard in Section 6.4.2 (Communicator > Co

Re: [OMPI users] MPI_Comm_create with unequal group arguments

2012-01-20 Thread Josh Hursey

That behavior is permitted by the MPI 2.2 standard. It seems that our documentation is incorrect in this regard. I'll file a bug to fix it. Just to clarify, in the MPI 2.2 standard in Section 6.4.2 (Communicator Constructors) under MPI_Comm_create it states: "Each process must call with a group ar

Re: [OMPI users] Checkpoint an MPI process

2012-01-20 Thread Josh Hursey

what's important to >> store, and how to do so, etc.). But if you're writing the application, >> you're better off to handle it internally, than externally. >> >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham You

Re: [OMPI users] Checkpoint an MPI process

2012-01-19 Thread Josh Hursey

Currently Open MPI only supports the checkpointing of the whole application. There has been some work on uncoordinated checkpointing with message logging, though I do not know the state of that work with regards to availability. That work has been undertaken by the University of Tennessee Knoxville

Re: [OMPI users] checkpointing on other transports

2012-01-17 Thread Josh Hursey

I have not tried to support a MTL with the checkpointing functionality, so I do not have first hand experience with those - just the OB1/BML/BTL stack. The difficulty in porting to a new transport is really a function of how the transport interacts with the checkpointer (e.g., BLCR). The draining

Re: [OMPI users] segfault when resuming on different host

2011-12-29 Thread Josh Hursey

Often this type of problem is due to the 'prelink' option in Linux. BLCR has a FAQ item that discusses this issue and how to resolve it: https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink I would give that a try. If that does not help then you might want to try checkpointing a single (non-M

Re: [OMPI users] MPI_COMM_split hanging

2011-12-12 Thread Josh Hursey

For MPI_Comm_split, all processes in the input communicator (oldcomm or MPI_COMM_WORLD in your case) must call the operation since it is collective over the input communicator. In your program rank 0 is not calling the operation, so MPI_Comm_split is waiting for it to participate. If you want rank

Re: [OMPI users] Process Migration

2011-11-10 Thread Josh Hursey

Note that the "migrate me from my current node to node " scenario is covered by the migration API exported by the C/R infrastructure, as I noted earlier. http://osl.iu.edu/research/ft/ompi-cr/api.php#api-cr_migrate The "move rank N to node " scenario could probably be added as an extension of th

Re: [OMPI users] Process Migration

2011-11-10 Thread Josh Hursey

The MPI standard does not provide explicit support for process migration. However, some MPI implementations (including Open MPI) have integrated such support based on checkpoint/restart functionality. For more information about the checkpoint/restart process migration functionality in Open MPI see

Re: [OMPI users] configure blcr errors openmpi 1.4.4

2011-10-31 Thread Josh Hursey

I wonder if the try_compile step is failing. Can you send a compressed copy of your config.log from this build? -- Josh On Mon, Oct 31, 2011 at 10:04 AM, wrote: > Hi ! > > I am trying to compile openmpi 1.4.4 with Torque, Infiniband and blcr > checkpoint support on Puias Linux 6.x (free de

Re: [OMPI users] Checkpoint from inside MPI program with OpenMPI 1.4.2 ?

2011-10-26 Thread Josh Hursey

, Nguyen Toan wrote: > Dear Josh, > Thank you. I will test the 1.7 trunk as you suggested. > Also I want to ask if we can add this interface to OpenMPI 1.4.2, > because my applications are mainly involved in this version. > Regards, > Nguyen Toan > On Wed, Oct 26, 2011 at 3:25 A

Re: [OMPI users] Checkpoint from inside MPI program with OpenMPI 1.4.2 ?

2011-10-25 Thread Josh Hursey

Open MPI (trunk/1.7 - not 1.4 or 1.5) provides an application level interface to request a checkpoint of an application. This API is defined on the following website: http://osl.iu.edu/research/ft/ompi-cr/api.php#api-cr_checkpoint This will behave the same as if you requested the checkpoint of t

Re: [OMPI users] Question regarding mpirun options with ompi-restart

2011-10-18 Thread Josh Hursey

That option is only available on the trunk at the moment. I filed a ticket to move the functionality to the 1.5 branch: https://svn.open-mpi.org/trac/ompi/ticket/2890 The work around would be to take the appfile generated from "ompi-restart --apponly ompi_snapshot...", and then run mpirun with t

Re: [OMPI users] Question regarding mpirun options with ompi-restart

2011-10-18 Thread Josh Hursey

That command line option may be only available on the trunk. What version of Open MPI are you using? -- Josh On Tue, Oct 18, 2011 at 11:14 AM, Faisal Shahzad wrote: > Hi, > Thank you for your reply. > I actually do not see option flag '--mpirun_opts' with 'ompi-restart > --help'. > Besides, I co

Re: [OMPI users] Question regarding mpirun options with ompi-restart

2011-10-18 Thread Josh Hursey

I'll preface my response with the note that I have not tried any of those options with the C/R functionality. It should just work, but I am not 100% certain. If it doesn't, let me know and I'll file a bug to fix it. You can pass any mpirun option through ompi-restart by using the --mpirun_opts opt

Re: [OMPI users] ompi-checkpoint problem on shared storage

2011-09-23 Thread Josh Hursey

It sounds like there is a race happening in the shutdown of the processes. I wonder if the app is shutting down in a way that mpirun does not quite like. I have not tested the C/R functionality in the 1.4 series in a long time. Can you give it a try with the 1.5 series, and see if there is any var

Re: [OMPI users] mpiexec option for node failure

2011-09-16 Thread Josh Hursey

Though I do not share George's pessimism about acceptance to the Open MPI community, it has been slightly difficult to add such a non-standard feature to the code base for various reasons. At ORNL, I have been developing a prototype for the MPI Forum Fault Tolerance Working Group [1] of the Run-Th

Re: [OMPI users] Question regarding SELF-checkpointing

2011-08-31 Thread Josh Hursey

That seems like a bug to me. What version of Open MPI are you using? How have you setup the C/R functionality (what MCA options do you have set, what command line options are you using)? Can you send a small reproducing application that we can test against? That should help us focus in on the pro

Re: [OMPI users] Related to project ideas in OpenMPI

2011-08-26 Thread Josh Hursey

There are some great comments in this thread. Process migration (like many topics in systems) can get complex fast. The Open MPI process migration implementation is checkpoint/restart based (currently using BLCR), and uses an 'eager' style of migration. This style of migration stops a process comp

Re: [OMPI users] help regarding SELF checkpointing, c or c++

2011-08-01 Thread Josh Hursey

There should not be any issue is checkpointing a C++ vs C program using the 'self' checkpointer. The self checkpointer just looks for a particular function name to be present in the compiled program binary. Something to try is to run 'nm' on the compiled C++ program and make sure that the 'self' ch

Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl

2011-06-23 Thread Josh Hursey

I wonder if this is related to memory pinning. Can you try turning off the leave pinned, and see if the problem persists (this may affect performance, but should avoid the crash): mpirun ... --mca mpi_leave_pinned 0 ... Also it looks like Smoky has a slightly newer version of the 1.4 branch that

Re: [OMPI users] Meaning of ./configure --with-ft=LAM option

2011-06-20 Thread Josh Hursey

When we started adding Checkpoint/Restart functionality to Open MPI, we were hoping to provide a LAM/MPI-like interface to the C/R functionality. So we added a configure option as a placeholder. The 'LAM' option was intended to help those transitioning from LAM/MPI to Open MPI. However we never got

Re: [OMPI users] ompi-restart, ompi-ps problem

2010-07-16 Thread Josh Hursey

(Sorry for the late reply) On Jun 7, 2010, at 4:48 AM, Nguyen Kim Son wrote: > Hello, > > I'n trying to get functions like orte-checkpoint, orte-restart,... works but > there are some errors that I don't have any clue about. > > Blcr (0.8.2) works fine apparently and I have installed openmpi

Re: [OMPI users] ompi-restart failed

2010-07-16 Thread Josh Hursey

Open MPI can restart multi-threaded applications on any number of nodes (I do this routinely in testing). If you are still experiencing this problem (sorry for the late reply), can you send me the MCA parameters that you are using, command line, and a backtrace from the corefile generated by th

Re: [OMPI users] How to checkpoint atomic function in OpenMPI

2010-07-16 Thread Josh Hursey

On Jun 14, 2010, at 5:26 AM, Nguyen Toan wrote: > Hi all, > I have a MPI program as follows: > --- > int main(){ >MPI_Init(); >.. >for (i=0; i<1; i++) { > my_atomic_func(); >} >... >MPI_Finalize(); >return 0; > } > >

Re: [OMPI users] Question on checkpoint overhead in Open MPI

2010-07-16 Thread Josh Hursey

The amount of checkpoint overhead is application and system configuration specific. So it is impossible to give you a good answer to how much checkpoint overhead to expect for your application and system setup. BLCR is only used to capture the single process image. The coordination of the distr

Re: [OMPI users] Some Questions on Building OMPI on Linux Em64t

2010-05-26 Thread Josh Hursey

(Sorry for the delay, I missed the C/R question in the mail) On May 25, 2010, at 9:35 AM, Jeff Squyres wrote: On May 24, 2010, at 2:02 PM, Michael E. Thomadakis wrote: | > 2) I have installed blcr V0.8.2 but when I try to built OMPI and I point to the | > full installation it complains it ca

Re: [OMPI users] Using a rankfile for ompi-restart

2010-05-18 Thread Josh Hursey

(Sorry for the delay in replying, more below) On Apr 8, 2010, at 1:34 PM, Fernando Lemos wrote: Hello, I've noticed that ompi-restart doesn't support the --rankfile option. It only supports --hostfile/--machinefile. Is there any reason --rankfile isn't supported? Suppose you have a cluster w

Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-05-18 Thread Josh Hursey

(Sorry for the delay in replying, more below) On Apr 12, 2010, at 6:36 AM, Hideyuki Jitsumoto wrote: Hi Members, I tried to use checkpoint/restart by openmpi. But I can not get collect checkpoint data. I prepared execution environment as follows, the strings in () mean name of output file whic

Re: [OMPI users] (no subject)

2010-05-18 Thread Josh Hursey

The functionality of checkpoint operation is not tied to CPU utilization. Are you running with the C/R thread enabled? If not then the checkpoint might be waiting until the process enters the MPI library. Does the system emit an error message describing the error that it encountered? Th

Re: [OMPI users] opal_cr_tmp_dir

2010-05-18 Thread Josh Hursey

When you defined them in your environment did you prefix them with 'OMPI_MCA_'? Open MPI looks for this prefix to identify which parameters are intended for it specifically. -- Josh On May 12, 2010, at 11:09 PM, > wrote: Ralph Defining these parameters in my environment also did not res

Re: [OMPI users] ompi-restart fails with "found pid in use"

2010-05-18 Thread Josh Hursey

So I recently hit this same problem while doing some scalability testing. I experimented with adding the --no-restore-pid option, but found the same problem as you mention. Unfortunately, the problem is with BLCR, not Open MPI. BLCR will restart the process with a new PID, but the value ret

Re: [OMPI users] Hibernating/Wakeup MPI processes

2010-04-13 Thread Josh Hursey

So what you are looking for is checkpoint/restart support, which you can find some details about at the link below: http://osl.iu.edu/research/ft/ompi-cr/ Additionally, we relatively recently added the ability to checkpoint and 'stop' the application. This generates a usable checkpoint of t

Re: [OMPI users] Segmentation fault (11)

2010-03-29 Thread Josh Hursey

I wonder if this is a bug with BLCR (since the segv stack is in the BLCR thread). Can you try an non-MPI version of this application that uses popen(), and see if BLCR properly checkpoints/restarts it? If so, we can start to see what Open MPI might be doing to confuse things, but I suspect

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-29 Thread Josh Hursey

cheers fengguang On Mon, Mar 29, 2010 at 11:42 AM, Josh Hursey mpi.org> wrote: On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote: On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian wrote: I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir --hostfile .mpihostfile x

Re: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster

2010-03-29 Thread Josh Hursey

Does this happen when you run without '-am ft-enable-cr' (so a no-C/R run)? This will help us determine if your problem is with the C/R work or with the ORTE runtime. I suspect that there is something odd with your system that is confusing the runtime (so not a C/R problem). Have you made

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-29 Thread Josh Hursey

On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote: On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian wrote: I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir --hostfile .mpihostfile to store the global checkpoint snapshot into the shared directory:/mirror,but

Re: [OMPI users] Meaning and the significance of MCA parameter "opal_cr_use_thread"

2010-03-29 Thread Josh Hursey

So the MCA parameter that you mention is explained at the link below: http://osl.iu.edu/research/ft/ompi-cr/api.php#mca-opal_cr_use_thread This enables/disables the C/R thread a runtime if Open MPI was configured with C/R thread support: http://osl.iu.edu/research/ft/ompi-cr/api.php#conf-e

Re: [OMPI users] mpirun with -am ft-enable-cr option takes longer time on certain configurations

2010-03-29 Thread Josh Hursey

On Mar 20, 2010, at 11:14 PM, > wrote: I am observing a very strange performance issue with my openmpi program. I have compute intensive openmpi based application that keeps the data in memory, process the data and then dumps it to GPFS parallel file system. GPFS parallel file system s

Re: [OMPI users] mpirun with -am ft-enable-cr option runs slow if hyperthreading is disabled

2010-03-29 Thread Josh Hursey

On Mar 22, 2010, at 4:41 PM, > wrote: Hi If the run my compute intensive openmpi based program using regular invocation of mpirun (ie; mpirun –host -np cores>), it gets completed in few seconds but if I run the same program with “-am ft-enable-cr” option, the program takes 10x time to

Re: [OMPI users] top command output shows huge CPU utilization when openmpi processes resume after the checkpoint

2010-03-29 Thread Josh Hursey

On Mar 21, 2010, at 12:58 PM, Addepalli, Srirangam V wrote: Yes We have seen this behavior too. Another behavior I have seen is that one MPI process starts to show different elapsed time than its peers. Is it because checkpoint happened on behalf of this process? R __

Re: [OMPI users] can torque support openmpi checkpoint?

2010-03-18 Thread Josh Hursey

I have not been working with the integration of Open MPI and Torque directly, so I cannot state how well this is supported. However, the BLCR folks have been working on a Torque/Open MPI/BLCR project for a while now, and have had some success. You might want to raise the question on the BLC

Re: [OMPI users] change hosts to restart the checkpoint

2010-03-05 Thread Josh Hursey

This type of failure is usually due to prelink'ing being left enabled on one or more of the systems. This has come up multiple times on the Open MPI list, but is actually a problem between BLCR and the Linux kernel. BLCR has a FAQ entry on this that you will want to check out: https://upc-

Re: [OMPI users] orte-checkpoint hangs

2010-02-25 Thread Josh Hursey

On Feb 10, 2010, at 9:45 AM, Addepalli, Srirangam V wrote: > I am trying to test orte-checkpoint with a MPI JOB. It how ever hangs for all > jobs. This is how i submit the job is started > mpirun -np 8 -mca ft-enable cr /apps/nwchem-5.1.1/bin/LINUX64/nwchem > siosi6.nw This might be the prob

Re: [OMPI users] Torque+BCLR+OpenMPI

2010-02-25 Thread Josh Hursey

Anton, I don't know if there usual or typical way of initiating a checkpoint amongst various resource managers. I know that the BLCR folks (I believe Eric Roman is heading this effort - CC'ed) have been investigating a tighter integration of Open MPI, BLCR and Torque. He might be able to give y

Re: [OMPI users] Checkpoint/Restart error

2010-02-01 Thread Josh Hursey

Thanks for the bug report. There are a couple of places in the code that, in a sense, hard code '/tmp' as the temporary directory. It shouldn't be to hard to fix since there is a common function used in the code to discovery the 'true' temporary directory (which defaults to /tmp). Of course

Re: [OMPI users] checkpointing multi node and multi process applications

2010-01-25 Thread Josh Hursey

v1.5 series if possible. -- Josh On Jan 25, 2010, at 3:33 PM, Josh Hursey wrote: So while working on the error message, I noticed that the global coordinator was using the wrong path to investigate the checkpoint metadata. This particular section of code is not often used (which is

Re: [OMPI users] checkpointing multi node and multi process applications

2010-01-25 Thread Josh Hursey

reproduce it. Can you try the trunk (either SVN checkout or nightly tarball from tonight) and check if this solves your problem? Cheers, Josh On Jan 25, 2010, at 12:14 PM, Josh Hursey wrote: I am not able to reproduce this problem with the 1.4 branch using a hostfile, and node configuration

Re: [OMPI users] checkpointing multi node and multi process applications

2010-01-25 Thread Josh Hursey

can be wrong? Please instruct me on how to resolve this problem. Thank you Jean --- On Mon, 11/1/10, Josh Hursey wrote: From: Josh Hursey Subject: Re: [OMPI users] checkpointing multi node and multi process applications To: "Open MPI Users" Date: Monday, 11 January, 2010

Re: [OMPI users] Checkpoint/Restart error

2010-01-25 Thread Josh Hursey

I tested the 1.4.1 release, and everything worked fine for me (tested a few different configurations of nodes/environments). The ompi-checkpoint error you cited is usually caused by one of two things: - The PID specified is wrong (which I don't think that is the case here) - The session

Re: [OMPI users] checkpointing multi node and multi process applications

2010-01-11 Thread Josh Hursey

On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote: Hi Everyone, I am trying to checkpoint an mpi application running on multiple nodes. However, I get some error messages when i trigger the checkpointing process. Error: expected_component: PID information unavailable!

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2010-01-11 Thread Josh Hursey

.3.3/bin/orted sdiaz30574 0.0 0.0 52772 1188 ?D12:54 0:00 \_ /bin/bash /opt/ cesga/openmpi-1.3.3/bin/orted Josh Hursey escribió: On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote: Hi Josh, You were right. The main problem was the

Re: [OMPI users] problem restarting multiprocess mpi application

2010-01-11 Thread Josh Hursey

On Dec 13, 2009, at 3:57 PM, Kritiraj Sajadah wrote: Dear All, I am running a simple mpi application which looks as follows: ## #include #include #include #include #include int main(int argc, char **argv) { int rank,size; MPI_Init(&argc

Re: [OMPI users] Problem with checkpointing multihosts, multiprocesses MPI application

2010-01-11 Thread Josh Hursey

On Dec 12, 2009, at 10:03 AM, Kritiraj Sajadah wrote: Dear All, I am trying to checkpoint am MPI application which has two processes each running on two seperate hosts. I run the application as follows: raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca btl ^openi

Re: [OMPI users] Changing location where checkpoints are saved

2009-12-09 Thread Josh Hursey

this fixes the problem? -- Josh P.S. If you are interested, we have a slightly better version of the documentation, hosted at the link below: http://osl.iu.edu/research/ft/ompi-cr/ On Nov 18, 2009, at 1:27 PM, Constantinos Makassikis wrote: Josh Hursey wrote: (Sorry for the excessiv

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-12-09 Thread Josh Hursey

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ERROR3 > > > > > > > > > > >>>>>>>>>>

Re: [OMPI users] Problem with mpirun -preload-binary option

2009-12-09 Thread Josh Hursey

I verified that the preload functionality works on the trunk. It seems to be broken on the v1.3/v1.4 branches. The version of this code has changed significantly between the v1.3/v1.4 and the trunk/v1.5 versions. I filed a bug about this so it does not get lost: https://svn.open-mpi.org/tr

Re: [OMPI users] ompi-restart using different nodes

2009-12-09 Thread Josh Hursey

f specifying the hostfile, same result. thanks, Jonathan Josh Hursey wrote: Though I do not test this scenario (using hostfiles) very often, it used to work. The ompi-restart command takes a --hostfile (or -- machinefile) argument that is passed directly to the mpirun command. I wonder if some

Re: [OMPI users] ompi-restart using different nodes

2009-12-02 Thread Josh Hursey

Though I do not test this scenario (using hostfiles) very often, it used to work. The ompi-restart command takes a --hostfile (or -- machinefile) argument that is passed directly to the mpirun command. I wonder if something broke recently with this handoff. I can certainly checkpoint with on

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-11 Thread Josh Hursey

mpi process. I can change this protocol and use ssh. So, I'm going to > test it this afternoon and I will comment to you the results. Try 'ssh' and see if that helps. I suspect the problem is with the session directory location though. > > Regards, > Sergio >

Re: [OMPI users] mpirun noticed that process rank 1 ... exited on signal 13 (Broken pipe).

2009-11-11 Thread Josh Hursey

On Nov 6, 2009, at 7:59 AM, Kritiraj Sajadah wrote: > Hi Everyone, > I have install openmpi 1.3 and blcr 0.81 on my laptop (single > processor). > > I am trying to checkpoint a small test application: > > ### > > #include > #include > #include > #include > #include >

Re: [OMPI users] Problem with mpirun -preload-binary option

2009-11-11 Thread Josh Hursey

Though the --preload-binary option was created while building the checkpoint/restart functionality it does not depend on checkpoint/restart function in any way (just a side effect of the initial development). The problem you are seeing is a result of the computing environment setup of password-

Re: [OMPI users] Question about checkpoint/restart protocol

2009-11-06 Thread Josh Hursey

On Nov 5, 2009, at 4:46 AM, Mohamed Adel wrote: Dear Sergio, Thank you for your reply. I've inserted the modules into the kernel and it all worked fine. But there is still a weired issue. I use the command "mpirun -n 2 -am ft-enable-cr -H comp001 checkpoint-restart- test" to start the an

Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2009-11-06 Thread Josh Hursey

On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote: Hello, I have achieved the checkpoint of an easy program without SGE. Now, I'm trying to do the integration openmpi+sge but I have some problems... When I try to do checkpoint of the mpirun PID, I got an error similar to the error gotten wh

Re: [OMPI users] problems with checkpointing an mpi job

2009-11-06 Thread Josh Hursey

On Oct 30, 2009, at 1:35 PM, Hui Jin wrote: Hi All, I got a problem when trying to checkpoint a mpi job. I will really appreciate if you can help me fix the problem. the blcr package was installed successfully on the cluster. I configure the ompenmpi with flags, ./configure --with-ft=cr --enabl

Re: [OMPI users] problem using openmpi with DMTCP

2009-11-06 Thread Josh Hursey

(Sorry for the excessive delay in replying) I do not have any experience with the DMTCP project, so I can only speculate on what might be going on here. If you are using DMTCP to transparently checkpoint Open MPI you will need to make sure that you are not using any other interconnect other

1 2 >

1 - 100 of 194 matches

Mail list logo