to clear things, i still can do a hello world on all 16 threads, but a few more repetitions of the example and it kernel crashes :(
fcluster@agua:~$ mpirun --hostfile localhostfile -np 16 testMPI/hola Process 0 on agua out of 16 Process 2 on agua out of 16 Process 14 on agua out of 16 Process 8 on agua out of 16 Process 1 on agua out of 16 Process 7 on agua out of 16 Process 9 on agua out of 16 Process 3 on agua out of 16 Process 4 on agua out of 16 Process 10 on agua out of 16 Process 15 on agua out of 16 Process 5 on agua out of 16 Process 6 on agua out of 16 Process 11 on agua out of 16 Process 13 on agua out of 16 Process 12 on agua out of 16 fcluster@agua:~$ On Wed, Jul 28, 2010 at 2:47 PM, Cristobal Navarro <axisch...@gmail.com>wrote: > > > On Wed, Jul 28, 2010 at 11:09 AM, Gus Correa <g...@ldeo.columbia.edu>wrote: > >> Hi Cristobal >> >> In case you are not using full path name for mpiexec/mpirun, >> what does "which mpirun" say? >> > > --> $which mpirun > /opt/openmpi-1.4.2 > >> >> Often times this is a source of confusion, old versions may >> be first on the PATH. >> >> Gus >> > > openMPI version problem is now gone, i can confirm that the version is > consistent now :), thanks. > > however, i keep getting this kernel crash randomnly when i execute with -np > higher than 5 > these are Xeons, with Hyperthreading On, is that a problem?? > > im trying to locate the kernel error on logs, but after rebooting a crash, > the error is not in the kern.log (neither kern.log.1). > all i remember is that it starts with "Kernel BUG..." > and somepart it mentions a certain CPU X, where that cpu can be any from 0 > to 15 (im testing only in main node). Someone knows where the log of kernel > error could be? > >> >> Cristobal Navarro wrote: >> >>> >>> On Tue, Jul 27, 2010 at 7:29 PM, Gus Correa <g...@ldeo.columbia.edu<mailto: >>> g...@ldeo.columbia.edu>> wrote: >>> >>> Hi Cristobal >>> >>> Does it run only on the head node alone? >>> (Fuego? Agua? Acatenango?) >>> Try to put only the head node on the hostfile and execute with >>> mpiexec. >>> >>> --> i will try only with the head node, and post results back >>> This may help sort out what is going on. >>> Hopefully it will run on the head node. >>> >>> Also, do you have Infinband connecting the nodes? >>> The error messages refer to the openib btl (i.e. Infiniband), >>> and complains of >>> >>> >>> no we are just using normal network 100MBit/s , since i am just testing >>> yet. >>> >>> >>> "perhaps a missing symbol, or compiled for a different >>> version of Open MPI?". >>> It sounds as a mixup of versions/builds. >>> >>> >>> --> i agree, somewhere there must be the remains of the older version >>> >>> Did you configure/build OpenMPI from source, or did you install >>> it with apt-get? >>> It may be easier/less confusing to install from source. >>> If you did, what configure options did you use? >>> >>> >>> -->i installed from source, ./configure --prefix=/opt/openmpi-1.4.2 >>> --with-sge --without-xgid --disable--static >>> >>> Also, as for the OpenMPI runtime environment, >>> it is not enough to set it on >>> the command line, because it will be effective only on the head node. >>> You need to either add them to the PATH and LD_LIBRARY_PATH >>> on your .bashrc/.cshrc files (assuming these files and your home >>> directory are *also* shared with the nodes via NFS), >>> or use the --prefix option of mpiexec to point to the OpenMPI main >>> directory. >>> >>> >>> yes, all nodes have their PATH and LD_LIBRARY_PATH set up properly inside >>> the login scripts ( .bashrc in my case ) >>> >>> Needless to say, you need to check and ensure that the OpenMPI >>> directory (and maybe your home directory, and your work directory) >>> is (are) >>> really mounted on the nodes. >>> >>> >>> --> yes, doublechecked that they are >>> >>> I hope this helps, >>> >>> >>> --> thanks really! >>> >>> Gus Correa >>> >>> Update: i just reinstalled openMPI, with the same parameters, and it >>> seems that the problem has gone, i couldnt test entirely but when i >>> get back to lab ill confirm. >>> >>> best regards! Cristobal >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >