>This line is probably what is limiting you to around 40gb. >#SBATCH --mem=38GB
Yes. If I change that value, the "ulimit -v" also changes. See below [shams@hpc ~]$ cat slurm_blast.sh | grep mem #SBATCH --mem=50GB [shams@hpc ~]$ cat my_blast.log virtual memory (kbytes, -v) 57671680 /var/spool/slurmd/job00306/slurm_script: line 13: ulimit: virtual memory: cannot modify limit: Operation not permitted virtual memory (kbytes, -v) 57671680 Error memory mapping:/home/shams/ncbi-blast-2.9.0+/bin/nr.69.psq openedFilesCount=168 threadID=0 Error: NCBI C++ Exception: However, the solution is not to change that parameter. There are two issue with that: 1) --mem belongs to the physical memory which is requested by job and is later reserved for the job by slurm. So, on a 64GB node, if a user requests --mem=50GB, actually no one else can run a job with 10GB memory need. 2) The virtual size of the program (according) to the top is about 140GB. So, if I set --mem=140GB, the job stuck in the queue because requested information is invalid (node has 64GB of memory). I really think there is a problem with slurm but can not find the root of the problem. The slurm config parameters are Configuration data as of 2020-01-28T08:04:55 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos,safe,wckeys AccountingStorageHost = hpc AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = 0 AuthAltTypes = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BOOT_TIME = 2020-01-27T09:53:58 BurstBufferType = (null) CheckpointType = checkpoint/none CliFilterPlugins = (null) ClusterName = jupiter CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = Performance,OnDemand,UserSpace CredType = cred/munge DebugFlags = Backfill,BackfillMap,NO_CONF_HASH,Priority DefMemPerNode = UNLIMITED DisableRootJobs = No EioTimeout = 60 EnforcePartLimits = NO Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 5 FastSchedule = 0 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 30 sec JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCheckpointDir = /var/spool/slurm.checkpoint JobCompHost = hpc JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 60 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = (null) LicensesUsed = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1001 MaxJobCount = 10000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = none MpiParams = (null) MsgAggregationParams = (null) NEXT_JOB_ID = 305 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = /etc/slurm/plugstack.conf PowerParameters = (null) PowerPlugin = PreemptMode = OFF PreemptType = preempt/none PreemptExemptTime = 00:00:00 PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityDecayHalfLife = 14-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 1-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 10 PriorityWeightAssoc = 0 PriorityWeightFairShare = 10000 PriorityWeightJobSize = 100 PriorityWeightPartition = 10000 PriorityWeightQOS = 0 PriorityWeightTRES = cpu=2000,mem=1,gres/gpu=400 PrivateData = none ProctrackType = proctrack/linuxproc Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = /etc/slurm/resumehost.sh ResumeRate = 4 nodes/min ResumeTimeout = 450 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 2 RoutePlugin = route/default SallocDefaultCommand = (null) SbcastParameters = (null) SchedulerParameters = (null) SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY SlurmUser = root(0) SlurmctldAddr = (null) SlurmctldDebug = info SlurmctldHost[0] = hpc(10.1.1.1) SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmctldPort = 6817 SlurmctldSyslogDebug = unknown SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 300 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdSyslogDebug = unknown SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 19.05.2 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /var/spool/slurm.state SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = /etc/slurm/suspendhost.sh SuspendRate = 4 nodes/min SuspendTime = NONE SuspendTimeout = 45 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/affinity TaskPluginParam = (null type) TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /state/partition1 TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = Yes TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 110 percent WaitTime = 60 sec X11Parameters = (null) Regards, Mahmood