Feat/SlurmHighAvailability #4386
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Summary: High Availability (HA) Support for Slurm on GCP
Overview
This PR introduces support for High Availability (HA). It enhances cluster resilience by enabling redundant Slurm controllers, robust state storage, and integrated health monitoring.
Key Features Introduced
1. High Availability for Slurm Controllers
nb_controllers
variable to configure single or dual controller deployment.google_compute_address
.SlurmctldHost
now supports multiple hosts.AccountingStorageBackupHost
andDbdBackupHost
added for failover.2. HA Architecture for Slurm State Storage
google_compute_region_disk
).machine_slurm_state_storage
) to host the disk./var/spool/slurm
/opt/apps
/home
/etc/munge
3. Integrated Healthcheck Daemon (
slurmhcd
)slurmhcd.py
monitors:slurmctld
,slurmdbd
,slurmrestd
)8080
.slurmhcd.service
) for automatic startup and recovery.3. Healthcheck for Slurm State Storage (NFS)
5. Secure and Resilient JWT Key Setup
jwt_hs256.key
are validated before reuse.Technical Enhancements
Terraform Modules
nb_controllers
machine_slurm_state_storage
slurm_state_ip
slurm_state_storage_scopes
slurm_control_hosts
schedmd-slurm-gcp-v6-controller
slurm_files
slurm_controller_instance
now references the primary template.Python Setup Scripts
setup.py
:conf.py
:setup_network_storage.py
:spool_mount_handler
.New Dependencies
Added to
requirements.txt
:python-daemon==3.0.1
PyMySQL==1.1.0
psutil==5.9.8
Impact Assessment