Hi, Am 30.03.2013 um 14:46 schrieb Patrick Bégou:
> Ok, so your problem is identified as a stack size problem. I went into these > limitations using Intel fortran compilers on large data problems. > > First, it seems you can increase your stack size as "ulimit -s unlimited" > works (you didn't enforce the system hard limit). The best way is to set > this setting in your .bashrc file so it will works on every node. > But setting it to unlimited may not be really safe. IE, if you got in a badly > coded recursive function calling itself without a stop condition you can > request all the system memory and crash the node. So set a large but limited > value, it's safer. > > I'm managing a cluster and I always set a maximum value to stack size. I also > limit the memory available for each core for system stability. If a user > request only one of the 12 cores of a node he can only access 1/12 of the > node memory amount. If he needs more memory he has to request 2 cores, even > if he uses a sequential code. This avoid crashing jobs of other users on the > same node with memory requirements. But this is not configured on your node. This is one way to implement memory limits as a policy - it's up to the user to request the correct number of cores then although he wants to run a serial job only. Personally I prefer that the user specifies the requested memory in such a case. It's up to the queuingsystem then to avoid that additional jobs are scheduled to a machine unless the remaining memory is sufficient for their execution in such a situation. -- Reuti > Duke Nguyen a écrit : >> On 3/30/13 3:13 PM, Patrick Bégou wrote: >>> I do not know about your code but: >>> >>> 1) did you check stack limitations ? Typically intel fortran codes needs >>> large amount of stack when the problem size increase. >>> Check ulimit -a >> >> First time I heard of stack limitations. Anyway, ulimit -a gives >> >> $ ulimit -a >> core file size (blocks, -c) 0 >> data seg size (kbytes, -d) unlimited >> scheduling priority (-e) 0 >> file size (blocks, -f) unlimited >> pending signals (-i) 127368 >> max locked memory (kbytes, -l) unlimited >> max memory size (kbytes, -m) unlimited >> open files (-n) 1024 >> pipe size (512 bytes, -p) 8 >> POSIX message queues (bytes, -q) 819200 >> real-time priority (-r) 0 >> stack size (kbytes, -s) 10240 >> cpu time (seconds, -t) unlimited >> max user processes (-u) 1024 >> virtual memory (kbytes, -v) unlimited >> file locks (-x) unlimited >> >> So stack size is 10MB??? Does this one create problem? How do I change this? >> >>> >>> 2) did your node uses cpuset and memory limitation like fake numa to set >>> the maximum amount of memory available for a job ? >> >> Not really understand (also first time heard of fake numa), but I am pretty >> sure we do not have such things. The server I tried was a dedicated server >> with 2 x5420 and 16GB physical memory. >> >>> >>> Patrick >>> >>> Duke Nguyen a écrit : >>>> Hi folks, >>>> >>>> I am sorry if this question had been asked before, but after ten days of >>>> searching/working on the system, I surrender :(. We try to use mpirun to >>>> run abinit (abinit.org) which in turns will call an input file to run some >>>> simulation. The command to run is pretty simple >>>> >>>> $ mpirun -np 4 /opt/apps/abinit/bin/abinit < input.files >& output.log >>>> >>>> We ran this command on a server with two quad core x5420 and 16GB of >>>> memory. I called only 4 core, and I guess in theory each of the core >>>> should take up to 2GB each. >>>> >>>> In the output of the log, there is something about memory: >>>> >>>> P This job should need less than 717.175 Mbytes of >>>> memory. >>>> Rough estimation (10% accuracy) of disk space for files : >>>> WF disk file : 69.524 Mbytes ; DEN or POT disk file : 14.240 Mbytes. >>>> >>>> So basically it reported that the above job should not take more than >>>> 718MB each core. >>>> >>>> But I still have the Segmentation Fault error: >>>> >>>> mpirun noticed that process rank 0 with PID 16099 on node biobos exited on >>>> signal 11 (Segmentation fault). >>>> >>>> The system already has limits up to unlimited: >>>> >>>> $ cat /etc/security/limits.conf | grep -v '#' >>>> * soft memlock unlimited >>>> * hard memlock unlimited >>>> >>>> I also tried to run >>>> >>>> $ ulimit -l unlimited >>>> >>>> before the mpirun command above, but it did not help at all. >>>> >>>> If we adjust the parameters of the input.files to give the reported mem >>>> per core is less than 512MB, then the job runs fine. >>>> >>>> Please help, >>>> >>>> Thanks, >>>> >>>> D. >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >