Feat/SlurmHighAvailability #4386

rbekhtaoui · 2025-07-11T15:51:52Z

Pull Request Summary: High Availability (HA) Support for Slurm on GCP

Overview

This PR introduces support for High Availability (HA). It enhances cluster resilience by enabling redundant Slurm controllers, robust state storage, and integrated health monitoring.

Key Features Introduced

1. High Availability for Slurm Controllers

Added nb_controllers variable to configure single or dual controller deployment.
Controllers are deployed via a regional Managed Instance Group (MIG).
Internal static IPs assigned using google_compute_address.
DNS resolution handled via a private DNS zone with per-controller records.
Updated Slurm configuration templates:
- SlurmctldHost now supports multiple hosts.
- AccountingStorageBackupHost and DbdBackupHost added for failover.

2. HA Architecture for Slurm State Storage

Replaced zonal disk with regional persistent disk (google_compute_region_disk).
Introduced a dedicated VM (machine_slurm_state_storage) to host the disk.
VM deployed via a regional MIG for availability.
NFS exports configured for:
- /var/spool/slurm
- /opt/apps
- /home
- /etc/munge
Mount logic updated to support primary/secondary controller roles.

3. Integrated Healthcheck Daemon (`slurmhcd`)

New Python daemon slurmhcd.py monitors:
- Service status (slurmctld, slurmdbd, slurmrestd)
- TCP port availability
- Process health
- SQL connectivity to SlurmDB
Exposed via HTTP on port 8080.
Managed via systemd (slurmhcd.service) for automatic startup and recovery.

3. Healthcheck for Slurm State Storage (NFS)

Implemented a TCP health check on port 2049 to monitor NFS availability.
Healthcheck is attached to the state storage MIG for auto-healing.
Firewall rule added to allow traffic from Google health check IP ranges.
Ensures reliable access to shared Slurm state and configuration directories.

5. Secure and Resilient JWT Key Setup

Permissions on jwt_hs256.key are validated before reuse.
Retry logic added (up to 3 attempts) for key generation.
Automatic cleanup and regeneration on failure.

Technical Enhancements

Terraform Modules

New variables added:
- nb_controllers
- machine_slurm_state_storage
- slurm_state_ip
- slurm_state_storage_scopes
- slurm_control_hosts
Refactored:
- schedmd-slurm-gcp-v6-controller
- slurm_files
Updated outputs:
- slurm_controller_instance now references the primary template.
- SSH instructions adapted for MIG-based deployment.

Python Setup Scripts

setup.py:
- Role-aware logic for controller setup.
- Conditional mounting of state disk.
- Multi-host MySQL configuration.
- Healthcheck daemon installation and activation.
conf.py:
- Dynamic generation of controller host entries for Slurm and SlurmDBD.
setup_network_storage.py:
- Added spool_mount_handler.
- Increased timeout for critical mounts.

New Dependencies

Added to requirements.txt:

python-daemon==3.0.1
PyMySQL==1.1.0
psutil==5.9.8

Impact Assessment

Area	Impact
Availability	Controllers and state storage are now HA-enabled
Resilience	Failover supported for SlurmDB and controller services
Observability	Healthcheck daemon provides real-time service status

Feat/SlurmHighAvailability

9977a31

rbekhtaoui requested review from samskillman and a team as code owners July 11, 2025 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/SlurmHighAvailability #4386

Feat/SlurmHighAvailability #4386

Uh oh!

rbekhtaoui commented Jul 11, 2025

Uh oh!

Uh oh!

Feat/SlurmHighAvailability #4386

Are you sure you want to change the base?

Feat/SlurmHighAvailability #4386

Uh oh!

Conversation

rbekhtaoui commented Jul 11, 2025

Pull Request Summary: High Availability (HA) Support for Slurm on GCP

Overview

Key Features Introduced

1. High Availability for Slurm Controllers

2. HA Architecture for Slurm State Storage

3. Integrated Healthcheck Daemon (slurmhcd)

3. Healthcheck for Slurm State Storage (NFS)

5. Secure and Resilient JWT Key Setup

Technical Enhancements

Terraform Modules

Python Setup Scripts

New Dependencies

Impact Assessment

Uh oh!

Uh oh!

3. Integrated Healthcheck Daemon (`slurmhcd`)