Skip to content

Feat/SlurmHighAvailability #4386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

rbekhtaoui
Copy link
Contributor

Pull Request Summary: High Availability (HA) Support for Slurm on GCP

Overview

This PR introduces support for High Availability (HA). It enhances cluster resilience by enabling redundant Slurm controllers, robust state storage, and integrated health monitoring.


Key Features Introduced

1. High Availability for Slurm Controllers

  • Added nb_controllers variable to configure single or dual controller deployment.
  • Controllers are deployed via a regional Managed Instance Group (MIG).
  • Internal static IPs assigned using google_compute_address.
  • DNS resolution handled via a private DNS zone with per-controller records.
  • Updated Slurm configuration templates:
    • SlurmctldHost now supports multiple hosts.
    • AccountingStorageBackupHost and DbdBackupHost added for failover.

2. HA Architecture for Slurm State Storage

  • Replaced zonal disk with regional persistent disk (google_compute_region_disk).
  • Introduced a dedicated VM (machine_slurm_state_storage) to host the disk.
  • VM deployed via a regional MIG for availability.
  • NFS exports configured for:
    • /var/spool/slurm
    • /opt/apps
    • /home
    • /etc/munge
  • Mount logic updated to support primary/secondary controller roles.

3. Integrated Healthcheck Daemon (slurmhcd)

  • New Python daemon slurmhcd.py monitors:
    • Service status (slurmctld, slurmdbd, slurmrestd)
    • TCP port availability
    • Process health
    • SQL connectivity to SlurmDB
  • Exposed via HTTP on port 8080.
  • Managed via systemd (slurmhcd.service) for automatic startup and recovery.

3. Healthcheck for Slurm State Storage (NFS)

  • Implemented a TCP health check on port 2049 to monitor NFS availability.
  • Healthcheck is attached to the state storage MIG for auto-healing.
  • Firewall rule added to allow traffic from Google health check IP ranges.
  • Ensures reliable access to shared Slurm state and configuration directories.

5. Secure and Resilient JWT Key Setup

  • Permissions on jwt_hs256.key are validated before reuse.
  • Retry logic added (up to 3 attempts) for key generation.
  • Automatic cleanup and regeneration on failure.

Technical Enhancements

Terraform Modules

  • New variables added:
    • nb_controllers
    • machine_slurm_state_storage
    • slurm_state_ip
    • slurm_state_storage_scopes
    • slurm_control_hosts
  • Refactored:
    • schedmd-slurm-gcp-v6-controller
    • slurm_files
  • Updated outputs:
    • slurm_controller_instance now references the primary template.
    • SSH instructions adapted for MIG-based deployment.

Python Setup Scripts

  • setup.py:

    • Role-aware logic for controller setup.
    • Conditional mounting of state disk.
    • Multi-host MySQL configuration.
    • Healthcheck daemon installation and activation.
  • conf.py:

    • Dynamic generation of controller host entries for Slurm and SlurmDBD.
  • setup_network_storage.py:

    • Added spool_mount_handler.
    • Increased timeout for critical mounts.

New Dependencies

Added to requirements.txt:

  • python-daemon==3.0.1
  • PyMySQL==1.1.0
  • psutil==5.9.8

Impact Assessment

Area Impact
Availability Controllers and state storage are now HA-enabled
Resilience Failover supported for SlurmDB and controller services
Observability Healthcheck daemon provides real-time service status

@rbekhtaoui rbekhtaoui requested review from samskillman and a team as code owners July 11, 2025 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant