We're running oge 2011.11 on an sgi uv 2000 (smp) w 256 hyperthreaded cores (128 physical). When we run an openmp job on the system, it runs fine. Here's the job: #include <iostream> #include <cstring> #include <cstdlib> #include <math.h> #include <omp.h>
using namespace std; int main ( int argc, char* argv[] ) { #if _OPENMP // Show how many threads we have available int max_t = omp_get_max_threads(); cout << "OpenMP using up to " << max_t << " threads" << endl; #else cout << "!!!ERROR!!! Program not compiled for OpenMP" << endl; return -1; #endif const long N = 115166; const long bytesRequested = N * N * sizeof(double); cout << "Allocating " << bytesRequested << " bytes for matrix" << endl; double* S = new double[ N * N ]; if( NULL == S ) { cout << "!!!ERROR!!! Failed to allocate " << bytesRequested << " bytes" << endl; return -1; } cout << "Entering main loop" << endl; #pragma omp parallel for schedule(static) for ( long i = 0; i < N - 1; i++ ) { for ( long j = i + 1; j < N; j++ ) { #if _OPENMP int tid=omp_get_thread_num(); if( 0 == i && 1 == j ) { int nThreads=omp_get_num_threads(); cout << "OpenMP loop using " << nThreads << " threads" << endl; } #endif S[ i * N + j ] = sqrt( i + j ); } } cout << "Loop completed" << endl; delete S; return 0; } And here's it being executed: [c++]$ ./OMPtest OpenMP using up to 256 threads Allocating 106105660448 bytes for matrix Entering main loop OpenMP loop using 256 threads Loop completed However, when I submit it in the queue using the following (and so far any) parallel environment, the load on the cpu shoots through the roof (well over 256), and the system becomes completely unresponsive and has to be power cycled. Here's my pe environment: [c++]$ qconf -sp threaded pe_name threaded slots 10000 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary TRUE I've changed control_slaves, job_is_first_task, slots (reduced to under 140, anything over 140 and I get the runaway load condition previously described) I've even used different parallel environments that I've created. I've also reduced slot count in the queue to 140, yet the load still runs away to over 256 and locks the machine (requiring a hard reboot). Lastly, I've tried numerous iterations of my qsub script, but here's my current version of it: #!/bin/sh #$ -cwd #$ -q sgi-test ## email on a - abort, b - begin, e - end #$ -m abe #$ -M <email address> #source ~/.bash_profile ## for this job, specifying the threaded environment w a "-" ensures the max number of processors is used #$ -pe threaded - echo "slots = $NSLOTS" export OMP_NUM_THREADS=$NSLOTS echo "OMP_NUM_THREADS=$OMP_NUM_THREADS" echo "Running on host=$HOSTNAME" ## memory resource request per thread, max 24 for 32 threads #$ -l h_vmem=4G ##$ -V ##this environment variable setting is needed only for OpenMP-parallelized applications ## finally! -- run your process <path>/OMPtest Since unlimited processors/slots have always crashed the machine, I've specified: #$ -pe threaded 139 Anything above 139 crashes the machine, yet there's no output in mcelog or /var/log/messages. Any insight into what could be happening would be greatly appreciated! Scott Lucas HPC Applications Support 208-776-0209 lucas.sco...@mayo.edu
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users