Dask-CUDA Multi-GPU PCA Hanging Issue

**Describe the issue**:

Multi-GPU PCA computation using `cuml.dask.decomposition.PCA`  with Dask-CUDA freezes/hangs during the modeling phase `fit_transform`, despite successful cluster initialization and data loading.

After inspection, both Dask workers are being assigned to the same GPU (GPU 0) instead of distributing across available GPUs (GPU 0 and GPU 1), causing resource contention.

**Minimal Complete Verifiable Example**:

The script is using `cuml.dask.decomposition.PCA` [code example](https://docs.rapids.ai/api/cuml/stable/api/#id45)
```python
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import cupy as cp
from cuml.dask.decomposition import PCA
from cuml.dask.datasets import make_blobs

def main():
    # Step 1: Set up Dask cluster and client with explicit GPU assignment
    # Set up the cluster with proper GPU device assignment
    cluster = LocalCUDACluster(
        #n_workers=2,
        threads_per_worker=1,
        CUDA_VISIBLE_DEVICES="0,1"
    )
    client = Client(cluster)
    print("Dask cluster started.")
    
    print(client)
    print(client.scheduler_info()["workers"])

    def get_gpu():
        import numba.cuda
        return numba.cuda.current_context().device.id

    print(client.run(get_gpu))

    # Step 2: Generate synthetic data
    nrows = 6
    ncols = 3
    n_parts = 2

    X_cudf, _ = make_blobs(
        n_samples=nrows,
        n_features=ncols,
        centers=1,
        n_parts=n_parts,
        cluster_std=0.01,
        random_state=10,
        dtype=cp.float32
    )

    # Step 3: Show raw data
    blobs = X_cudf.compute()
    print("Original data (blobs):")
    print(blobs)

    # Step 4: Run PCA
    cuml_model = PCA(n_components=1, whiten=False)
    XT = cuml_model.fit_transform(X_cudf)

    # Step 5: Print transformed result
    print("PCA result:")
    print(XT.compute())

    # Step 6: Cleanup
    client.close()
    cluster.close()

if __name__ == "__main__":
    main()

```
execute the script

```bash
$python run_pca_mnmg.py

Dask cluster started.

<Client: 'tcp://127.0.0.1:37969' processes=2 threads=2, memory=471.91 GiB>

=== CLUSTER WORKERS INFO ===
{
  "tcp://127.0.0.1:35293": {
    "type": "Worker",
    "id": 0,
    "host": "127.0.0.1",
    "resources": {},
    "local_directory": "/tmp/slurmtmp.1908664/dask-scratch-space/worker-mtk36sv3",
    "name": 0,
    "nthreads": 1,
    "memory_limit": 253356933120,
    "last_seen": 1753817725.569607,
    "services": {
      "dashboard": 35883
    },
    "metrics": {
      "task_counts": {},
      "bandwidth": {
        "total": 100000000,
        "workers": {},
        "types": {}
      },
      "digests_total_since_heartbeat": {},
      "managed_bytes": 0,
      "spilled_bytes": {
        "memory": 0,
        "disk": 0
      },
      "transfer": {
        "incoming_bytes": 0,
        "incoming_count": 0,
        "incoming_count_total": 0,
        "outgoing_bytes": 0,
        "outgoing_count": 0,
        "outgoing_count_total": 0
      },
      "event_loop_interval": 0.02,
      "cpu": 0.0,
      "memory": 495284224,
      "time": 1753817725.397159,
      "host_net_io": {
        "read_bps": 0.0,
        "write_bps": 0.0
      },
      "host_disk_io": {
        "read_bps": 0.0,
        "write_bps": 0.0
      },
      "num_fds": 25,
      "gpu-memory-total": 42949672960,
      "gpu_utilization": 0,
      "gpu_memory_used": 668008448,
      "rmm": {
        "rmm-used": 0,
        "rmm-total": 0
      }
    },
    "status": "running",
    "nanny": "tcp://127.0.0.1:38941"
  },
  "tcp://127.0.0.1:35497": {
    "type": "Worker",
    "id": 1,
    "host": "127.0.0.1",
    "resources": {},
    "local_directory": "/tmp/slurmtmp.1908664/dask-scratch-space/worker-cd3q7p5s",
    "name": 1,
    "nthreads": 1,
    "memory_limit": 253356933120,
    "last_seen": 1753817725.5525525,
    "services": {
      "dashboard": 42907
    },
    "metrics": {
      "task_counts": {},
      "bandwidth": {
        "total": 100000000,
        "workers": {},
        "types": {}
      },
      "digests_total_since_heartbeat": {},
      "managed_bytes": 0,
      "spilled_bytes": {
        "memory": 0,
        "disk": 0
      },
      "transfer": {
        "incoming_bytes": 0,
        "incoming_count": 0,
        "incoming_count_total": 0,
        "outgoing_bytes": 0,
        "outgoing_count": 0,
        "outgoing_count_total": 0
      },
      "event_loop_interval": 0.02,
      "cpu": 0.0,
      "memory": 494931968,
      "time": 1753817725.3971972,
      "host_net_io": {
        "read_bps": 0.0,
        "write_bps": 0.0
      },
      "host_disk_io": {
        "read_bps": 0.0,
        "write_bps": 0.0
      },
      "num_fds": 25,
      "gpu-memory-total": 42949672960,
      "gpu_utilization": 0,
      "gpu_memory_used": 668008448,
      "rmm": {
        "rmm-used": 0,
        "rmm-total": 0
      }
    },
    "status": "running",
    "nanny": "tcp://127.0.0.1:42639"
  }
}

=== GPU ASSIGNMENTS ===
{
  "tcp://127.0.0.1:35293": 0,
  "tcp://127.0.0.1:35497": 0
}

Original data (blobs):
[[8.699719  3.1151562 1.265141 ]
 [8.702976  3.1162481 1.257746 ]
 [8.705302  3.1097922 1.268311 ]
 [8.690783  3.119139  1.2682059]
 [8.692694  3.1102924 1.267426 ]
 [8.709211  3.090703  1.2538605]]

```
**Diagnosis:**

- The program didn't crash instead just hangs indefinitely at `fit_transform`
- Both workers report `device_id`: 0 instead of 0 and 1
- Workers have different GPU UUIDs in names but same CUDA context ID

**Attempted Solutions:**
We tried these but wasn't able to resolve the problem:
- Using external Dask scheduler with [manual worker spawning](https://docs.rapids.ai/api/dask-cuda/nightly/examples/worker_count/)
- Trying different network configurations
- Test using GPU IDs for `CUDA_VISIBLE_DEVICES` instead of indexes
- Test specifying number of workers explicitly to 2
- [Adding `CUDA_DEVICE_ORDER=PCI_BUS_ID`](https://docs.rapids.ai/api/dask-cuda/nightly/troubleshooting/#wrong-device-indexing) prior to commands 

We can successfully execute the script using one worker/one GPU. 

**Environment**:

```
Dask version: 2025.5.0
Python version: Python 3.10.13
Operating System: Red Hat Enterprise Linux 9.4 (Plow)
dask-cuda version: 25.06.00
cuML version: 25.06.00
CUDA version: 12.4.131
Install method: uv (pip-compatible)

Hardware: Dell PowerEdge R7545 with dual NVIDIA A100-PCIE-40GB GPUs
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-6bb39071-3763-ac4d-cae5-a5353c65567e)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-6f66494d-ba59-fab0-e923-fc7ed318d6e9)
Network: HDR100 InfiniBand (interface: ibp65s0)
```

Thanks in advance for your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Dask-CUDA Multi-GPU PCA Hanging Issue #9098

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Dask-CUDA Multi-GPU PCA Hanging Issue #9098

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions