Hi Mahmood, If you want the virtual memory size to be unrestricted by slurm, set VSizeFactor to 0 in slurm.conf, which according to the documentation disables virtual memory limit enforcement.
https://slurm.schedmd.com/slurm.conf.html#OPT_VSizeFactor -Sean On Mon, Jan 27, 2020 at 11:47 PM Mahmood Naderan <mahmood...@gmail.com> wrote: > >This line is probably what is limiting you to around 40gb. > > >#SBATCH --mem=38GB > > Yes. If I change that value, the "ulimit -v" also changes. See below > > [shams@hpc ~]$ cat slurm_blast.sh | grep mem > #SBATCH --mem=50GB > [shams@hpc ~]$ cat my_blast.log > virtual memory (kbytes, -v) 57671680 > /var/spool/slurmd/job00306/slurm_script: line 13: ulimit: virtual memory: > cannot modify limit: Operation not permitted > virtual memory (kbytes, -v) 57671680 > Error memory mapping:/home/shams/ncbi-blast-2.9.0+/bin/nr.69.psq > openedFilesCount=168 threadID=0 > Error: NCBI C++ Exception: > > > However, the solution is not to change that parameter. There are two issue > with that: > > 1) --mem belongs to the physical memory which is requested by job and is > later reserved for the job by slurm. > So, on a 64GB node, if a user requests --mem=50GB, actually no one else > can run a job with 10GB memory need. > > 2) The virtual size of the program (according) to the top is about 140GB. > So, if I set --mem=140GB, the job stuck in the queue because requested > information is invalid (node has 64GB of memory). > > > I really think there is a problem with slurm but can not find the root of > the problem. The slurm config parameters are > > Configuration data as of 2020-01-28T08:04:55 > AccountingStorageBackupHost = (null) > AccountingStorageEnforce = associations,limits,qos,safe,wckeys > AccountingStorageHost = hpc > AccountingStorageLoc = N/A > AccountingStoragePort = 6819 > AccountingStorageTRES = > cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu > AccountingStorageType = accounting_storage/slurmdbd > AccountingStorageUser = N/A > AccountingStoreJobComment = Yes > AcctGatherEnergyType = acct_gather_energy/none > AcctGatherFilesystemType = acct_gather_filesystem/none > AcctGatherInterconnectType = acct_gather_interconnect/none > AcctGatherNodeFreq = 0 sec > AcctGatherProfileType = acct_gather_profile/none > AllowSpecResourcesUsage = 0 > AuthAltTypes = (null) > AuthInfo = (null) > AuthType = auth/munge > BatchStartTimeout = 10 sec > BOOT_TIME = 2020-01-27T09:53:58 > BurstBufferType = (null) > CheckpointType = checkpoint/none > CliFilterPlugins = (null) > ClusterName = jupiter > CommunicationParameters = (null) > CompleteWait = 0 sec > CoreSpecPlugin = core_spec/none > CpuFreqDef = Unknown > CpuFreqGovernors = Performance,OnDemand,UserSpace > CredType = cred/munge > DebugFlags = Backfill,BackfillMap,NO_CONF_HASH,Priority > DefMemPerNode = UNLIMITED > DisableRootJobs = No > EioTimeout = 60 > EnforcePartLimits = NO > Epilog = (null) > EpilogMsgTime = 2000 usec > EpilogSlurmctld = (null) > ExtSensorsType = ext_sensors/none > ExtSensorsFreq = 0 sec > FairShareDampeningFactor = 5 > FastSchedule = 0 > FederationParameters = (null) > FirstJobId = 1 > GetEnvTimeout = 2 sec > GresTypes = gpu > GpuFreqDef = high,memory=high > GroupUpdateForce = 1 > GroupUpdateTime = 600 sec > HASH_VAL = Match > HealthCheckInterval = 0 sec > HealthCheckNodeState = ANY > HealthCheckProgram = (null) > InactiveLimit = 30 sec > JobAcctGatherFrequency = 30 > JobAcctGatherType = jobacct_gather/linux > JobAcctGatherParams = (null) > JobCheckpointDir = /var/spool/slurm.checkpoint > JobCompHost = hpc > JobCompLoc = /var/log/slurm_jobcomp.log > JobCompPort = 0 > JobCompType = jobcomp/none > JobCompUser = root > JobContainerType = job_container/none > JobCredentialPrivateKey = (null) > JobCredentialPublicCertificate = (null) > JobDefaults = (null) > JobFileAppend = 0 > JobRequeue = 1 > JobSubmitPlugins = (null) > KeepAliveTime = SYSTEM_DEFAULT > KillOnBadExit = 0 > KillWait = 60 sec > LaunchParameters = (null) > LaunchType = launch/slurm > Layouts = > Licenses = (null) > LicensesUsed = (null) > LogTimeFormat = iso8601_ms > MailDomain = (null) > MailProg = /bin/mail > MaxArraySize = 1001 > MaxJobCount = 10000 > MaxJobId = 67043328 > MaxMemPerNode = UNLIMITED > MaxStepCount = 40000 > MaxTasksPerNode = 512 > MCSPlugin = mcs/none > MCSParameters = (null) > MessageTimeout = 10 sec > MinJobAge = 300 sec > MpiDefault = none > MpiParams = (null) > MsgAggregationParams = (null) > NEXT_JOB_ID = 305 > NodeFeaturesPlugins = (null) > OverTimeLimit = 0 min > PluginDir = /usr/lib64/slurm > PlugStackConfig = /etc/slurm/plugstack.conf > PowerParameters = (null) > PowerPlugin = > PreemptMode = OFF > PreemptType = preempt/none > PreemptExemptTime = 00:00:00 > PriorityParameters = (null) > PrioritySiteFactorParameters = (null) > PrioritySiteFactorPlugin = (null) > PriorityDecayHalfLife = 14-00:00:00 > PriorityCalcPeriod = 00:05:00 > PriorityFavorSmall = No > PriorityFlags = > PriorityMaxAge = 1-00:00:00 > PriorityUsageResetPeriod = NONE > PriorityType = priority/multifactor > PriorityWeightAge = 10 > PriorityWeightAssoc = 0 > PriorityWeightFairShare = 10000 > PriorityWeightJobSize = 100 > PriorityWeightPartition = 10000 > PriorityWeightQOS = 0 > PriorityWeightTRES = cpu=2000,mem=1,gres/gpu=400 > PrivateData = none > ProctrackType = proctrack/linuxproc > Prolog = (null) > PrologEpilogTimeout = 65534 > PrologSlurmctld = (null) > PrologFlags = (null) > PropagatePrioProcess = 0 > PropagateResourceLimits = ALL > PropagateResourceLimitsExcept = (null) > RebootProgram = (null) > ReconfigFlags = (null) > RequeueExit = (null) > RequeueExitHold = (null) > ResumeFailProgram = (null) > ResumeProgram = /etc/slurm/resumehost.sh > ResumeRate = 4 nodes/min > ResumeTimeout = 450 sec > ResvEpilog = (null) > ResvOverRun = 0 min > ResvProlog = (null) > ReturnToService = 2 > RoutePlugin = route/default > SallocDefaultCommand = (null) > SbcastParameters = (null) > SchedulerParameters = (null) > SchedulerTimeSlice = 30 sec > SchedulerType = sched/backfill > SelectType = select/cons_res > SelectTypeParameters = CR_CORE_MEMORY > SlurmUser = root(0) > SlurmctldAddr = (null) > SlurmctldDebug = info > SlurmctldHost[0] = hpc(10.1.1.1) > SlurmctldLogFile = /var/log/slurm/slurmctld.log > SlurmctldPort = 6817 > SlurmctldSyslogDebug = unknown > SlurmctldPrimaryOffProg = (null) > SlurmctldPrimaryOnProg = (null) > SlurmctldTimeout = 300 sec > SlurmctldParameters = (null) > SlurmdDebug = info > SlurmdLogFile = /var/log/slurm/slurmd.log > SlurmdParameters = (null) > SlurmdPidFile = /var/run/slurmd.pid > SlurmdPort = 6818 > SlurmdSpoolDir = /var/spool/slurmd > SlurmdSyslogDebug = unknown > SlurmdTimeout = 300 sec > SlurmdUser = root(0) > SlurmSchedLogFile = (null) > SlurmSchedLogLevel = 0 > SlurmctldPidFile = /var/run/slurmctld.pid > SlurmctldPlugstack = (null) > SLURM_CONF = /etc/slurm/slurm.conf > SLURM_VERSION = 19.05.2 > SrunEpilog = (null) > SrunPortRange = 0-0 > SrunProlog = (null) > StateSaveLocation = /var/spool/slurm.state > SuspendExcNodes = (null) > SuspendExcParts = (null) > SuspendProgram = /etc/slurm/suspendhost.sh > SuspendRate = 4 nodes/min > SuspendTime = NONE > SuspendTimeout = 45 sec > SwitchType = switch/none > TaskEpilog = (null) > TaskPlugin = task/affinity > TaskPluginParam = (null type) > TaskProlog = (null) > TCPTimeout = 2 sec > TmpFS = /state/partition1 > TopologyParam = (null) > TopologyPlugin = topology/none > TrackWCKey = Yes > TreeWidth = 50 > UsePam = 0 > UnkillableStepProgram = (null) > UnkillableStepTimeout = 60 sec > VSizeFactor = 110 percent > WaitTime = 60 sec > X11Parameters = (null) > > > Regards, > Mahmood > > > >