Skip to content

Dask-CUDA Multi-GPU PCA Hanging Issue #9098

@NetZissou

Description

@NetZissou

Describe the issue:

Multi-GPU PCA computation using cuml.dask.decomposition.PCA with Dask-CUDA freezes/hangs during the modeling phase fit_transform, despite successful cluster initialization and data loading.

After inspection, both Dask workers are being assigned to the same GPU (GPU 0) instead of distributing across available GPUs (GPU 0 and GPU 1), causing resource contention.

Minimal Complete Verifiable Example:

The script is using cuml.dask.decomposition.PCA code example

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import cupy as cp
from cuml.dask.decomposition import PCA
from cuml.dask.datasets import make_blobs

def main():
    # Step 1: Set up Dask cluster and client with explicit GPU assignment
    # Set up the cluster with proper GPU device assignment
    cluster = LocalCUDACluster(
        #n_workers=2,
        threads_per_worker=1,
        CUDA_VISIBLE_DEVICES="0,1"
    )
    client = Client(cluster)
    print("Dask cluster started.")
    
    print(client)
    print(client.scheduler_info()["workers"])

    def get_gpu():
        import numba.cuda
        return numba.cuda.current_context().device.id

    print(client.run(get_gpu))

    # Step 2: Generate synthetic data
    nrows = 6
    ncols = 3
    n_parts = 2

    X_cudf, _ = make_blobs(
        n_samples=nrows,
        n_features=ncols,
        centers=1,
        n_parts=n_parts,
        cluster_std=0.01,
        random_state=10,
        dtype=cp.float32
    )

    # Step 3: Show raw data
    blobs = X_cudf.compute()
    print("Original data (blobs):")
    print(blobs)

    # Step 4: Run PCA
    cuml_model = PCA(n_components=1, whiten=False)
    XT = cuml_model.fit_transform(X_cudf)

    # Step 5: Print transformed result
    print("PCA result:")
    print(XT.compute())

    # Step 6: Cleanup
    client.close()
    cluster.close()

if __name__ == "__main__":
    main()

execute the script

$python run_pca_mnmg.py

Dask cluster started.

<Client: 'tcp://127.0.0.1:37969' processes=2 threads=2, memory=471.91 GiB>

=== CLUSTER WORKERS INFO ===
{
  "tcp://127.0.0.1:35293": {
    "type": "Worker",
    "id": 0,
    "host": "127.0.0.1",
    "resources": {},
    "local_directory": "/tmp/slurmtmp.1908664/dask-scratch-space/worker-mtk36sv3",
    "name": 0,
    "nthreads": 1,
    "memory_limit": 253356933120,
    "last_seen": 1753817725.569607,
    "services": {
      "dashboard": 35883
    },
    "metrics": {
      "task_counts": {},
      "bandwidth": {
        "total": 100000000,
        "workers": {},
        "types": {}
      },
      "digests_total_since_heartbeat": {},
      "managed_bytes": 0,
      "spilled_bytes": {
        "memory": 0,
        "disk": 0
      },
      "transfer": {
        "incoming_bytes": 0,
        "incoming_count": 0,
        "incoming_count_total": 0,
        "outgoing_bytes": 0,
        "outgoing_count": 0,
        "outgoing_count_total": 0
      },
      "event_loop_interval": 0.02,
      "cpu": 0.0,
      "memory": 495284224,
      "time": 1753817725.397159,
      "host_net_io": {
        "read_bps": 0.0,
        "write_bps": 0.0
      },
      "host_disk_io": {
        "read_bps": 0.0,
        "write_bps": 0.0
      },
      "num_fds": 25,
      "gpu-memory-total": 42949672960,
      "gpu_utilization": 0,
      "gpu_memory_used": 668008448,
      "rmm": {
        "rmm-used": 0,
        "rmm-total": 0
      }
    },
    "status": "running",
    "nanny": "tcp://127.0.0.1:38941"
  },
  "tcp://127.0.0.1:35497": {
    "type": "Worker",
    "id": 1,
    "host": "127.0.0.1",
    "resources": {},
    "local_directory": "/tmp/slurmtmp.1908664/dask-scratch-space/worker-cd3q7p5s",
    "name": 1,
    "nthreads": 1,
    "memory_limit": 253356933120,
    "last_seen": 1753817725.5525525,
    "services": {
      "dashboard": 42907
    },
    "metrics": {
      "task_counts": {},
      "bandwidth": {
        "total": 100000000,
        "workers": {},
        "types": {}
      },
      "digests_total_since_heartbeat": {},
      "managed_bytes": 0,
      "spilled_bytes": {
        "memory": 0,
        "disk": 0
      },
      "transfer": {
        "incoming_bytes": 0,
        "incoming_count": 0,
        "incoming_count_total": 0,
        "outgoing_bytes": 0,
        "outgoing_count": 0,
        "outgoing_count_total": 0
      },
      "event_loop_interval": 0.02,
      "cpu": 0.0,
      "memory": 494931968,
      "time": 1753817725.3971972,
      "host_net_io": {
        "read_bps": 0.0,
        "write_bps": 0.0
      },
      "host_disk_io": {
        "read_bps": 0.0,
        "write_bps": 0.0
      },
      "num_fds": 25,
      "gpu-memory-total": 42949672960,
      "gpu_utilization": 0,
      "gpu_memory_used": 668008448,
      "rmm": {
        "rmm-used": 0,
        "rmm-total": 0
      }
    },
    "status": "running",
    "nanny": "tcp://127.0.0.1:42639"
  }
}

=== GPU ASSIGNMENTS ===
{
  "tcp://127.0.0.1:35293": 0,
  "tcp://127.0.0.1:35497": 0
}

Original data (blobs):
[[8.699719  3.1151562 1.265141 ]
 [8.702976  3.1162481 1.257746 ]
 [8.705302  3.1097922 1.268311 ]
 [8.690783  3.119139  1.2682059]
 [8.692694  3.1102924 1.267426 ]
 [8.709211  3.090703  1.2538605]]

Diagnosis:

  • The program didn't crash instead just hangs indefinitely at fit_transform
  • Both workers report device_id: 0 instead of 0 and 1
  • Workers have different GPU UUIDs in names but same CUDA context ID

Attempted Solutions:
We tried these but wasn't able to resolve the problem:

We can successfully execute the script using one worker/one GPU.

Environment:

Dask version: 2025.5.0
Python version: Python 3.10.13
Operating System: Red Hat Enterprise Linux 9.4 (Plow)
dask-cuda version: 25.06.00
cuML version: 25.06.00
CUDA version: 12.4.131
Install method: uv (pip-compatible)

Hardware: Dell PowerEdge R7545 with dual NVIDIA A100-PCIE-40GB GPUs
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-6bb39071-3763-ac4d-cae5-a5353c65567e)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-6f66494d-ba59-fab0-e923-fc7ed318d6e9)
Network: HDR100 InfiniBand (interface: ibp65s0)

Thanks in advance for your time!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions