I recently tried to start up a c5.12xlarge instance on AWS, and ran into the case of the /mnt/shared/etc/slurm.conf file claiming that the instance should have RealMem=94992, but when the node comes up, slurmctld.log shows that the node has less memory than slurm.conf indicates, and thus Slurm rejects the node (puts it in DRAIN state):
[2021-03-05T18:40:19.392] _slurm_rpc_submit_batch_job: JobId=9 InitPrio=4294901757 usec=551
[2021-03-05T18:40:19.879] sched: Allocate JobId=9 NodeList=nice-wolf-c5-12xlarge-0001 #CPUs=48 Partition=compute
[2021-03-05T18:41:58.852] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
[2021-03-05T18:41:58.852] Node nice-wolf-c5-12xlarge-0001 now responding
[2021-03-05T18:41:58.852] error: Setting node nice-wolf-c5-12xlarge-0001 state to DRAIN
[2021-03-05T18:41:58.852] drain_nodes: node nice-wolf-c5-12xlarge-0001 state set to DRAIN
[2021-03-05T18:41:58.852] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument
[2021-03-05T18:41:59.855] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
[2021-03-05T18:41:59.855] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument
This led me to the following calculation for expected RealMem for AWS:
|
"memory": d["MemoryInfo"]["SizeInMiB"] |
"memory": d["MemoryInfo"]["SizeInMiB"] - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 500),
Contrast this to GCP memory calculation:
|
"memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500), |
"memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500),
It appears that the AWS config is attempting to estimate how much memory will actually be available (versus what is advertised), but the code for GCP is drastically underestimating.
Heavily under-estimating the amount of available memory allows Slurm to be more tolerant of nodes which don't quite meet their advertised claims, however it can cause issues when jobs request a specific amount of memory. These two cloud equations should probably be consistent, but I think the estimates need to be more conservative (lower) than AWS currently calculates, as shown by the example above with C5.12xlarge.
I recently tried to start up a c5.12xlarge instance on AWS, and ran into the case of the
/mnt/shared/etc/slurm.conffile claiming that the instance should haveRealMem=94992, but when the node comes up,slurmctld.logshows that the node has less memory thanslurm.confindicates, and thus Slurm rejects the node (puts it in DRAIN state):This led me to the following calculation for expected
RealMemfor AWS:python-citc/citc/aws.py
Line 104 in c32b80a
Contrast this to GCP memory calculation:
python-citc/citc/google.py
Line 106 in c32b80a
It appears that the AWS config is attempting to estimate how much memory will actually be available (versus what is advertised), but the code for GCP is drastically underestimating.
Heavily under-estimating the amount of available memory allows Slurm to be more tolerant of nodes which don't quite meet their advertised claims, however it can cause issues when jobs request a specific amount of memory. These two cloud equations should probably be consistent, but I think the estimates need to be more conservative (lower) than AWS currently calculates, as shown by the example above with C5.12xlarge.