-
-
Notifications
You must be signed in to change notification settings - Fork 735
Open
Labels
Description
Describe the issue:
Multi-GPU PCA computation using cuml.dask.decomposition.PCA
with Dask-CUDA freezes/hangs during the modeling phase fit_transform
, despite successful cluster initialization and data loading.
After inspection, both Dask workers are being assigned to the same GPU (GPU 0) instead of distributing across available GPUs (GPU 0 and GPU 1), causing resource contention.
Minimal Complete Verifiable Example:
The script is using cuml.dask.decomposition.PCA
code example
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import cupy as cp
from cuml.dask.decomposition import PCA
from cuml.dask.datasets import make_blobs
def main():
# Step 1: Set up Dask cluster and client with explicit GPU assignment
# Set up the cluster with proper GPU device assignment
cluster = LocalCUDACluster(
#n_workers=2,
threads_per_worker=1,
CUDA_VISIBLE_DEVICES="0,1"
)
client = Client(cluster)
print("Dask cluster started.")
print(client)
print(client.scheduler_info()["workers"])
def get_gpu():
import numba.cuda
return numba.cuda.current_context().device.id
print(client.run(get_gpu))
# Step 2: Generate synthetic data
nrows = 6
ncols = 3
n_parts = 2
X_cudf, _ = make_blobs(
n_samples=nrows,
n_features=ncols,
centers=1,
n_parts=n_parts,
cluster_std=0.01,
random_state=10,
dtype=cp.float32
)
# Step 3: Show raw data
blobs = X_cudf.compute()
print("Original data (blobs):")
print(blobs)
# Step 4: Run PCA
cuml_model = PCA(n_components=1, whiten=False)
XT = cuml_model.fit_transform(X_cudf)
# Step 5: Print transformed result
print("PCA result:")
print(XT.compute())
# Step 6: Cleanup
client.close()
cluster.close()
if __name__ == "__main__":
main()
execute the script
$python run_pca_mnmg.py
Dask cluster started.
<Client: 'tcp://127.0.0.1:37969' processes=2 threads=2, memory=471.91 GiB>
=== CLUSTER WORKERS INFO ===
{
"tcp://127.0.0.1:35293": {
"type": "Worker",
"id": 0,
"host": "127.0.0.1",
"resources": {},
"local_directory": "/tmp/slurmtmp.1908664/dask-scratch-space/worker-mtk36sv3",
"name": 0,
"nthreads": 1,
"memory_limit": 253356933120,
"last_seen": 1753817725.569607,
"services": {
"dashboard": 35883
},
"metrics": {
"task_counts": {},
"bandwidth": {
"total": 100000000,
"workers": {},
"types": {}
},
"digests_total_since_heartbeat": {},
"managed_bytes": 0,
"spilled_bytes": {
"memory": 0,
"disk": 0
},
"transfer": {
"incoming_bytes": 0,
"incoming_count": 0,
"incoming_count_total": 0,
"outgoing_bytes": 0,
"outgoing_count": 0,
"outgoing_count_total": 0
},
"event_loop_interval": 0.02,
"cpu": 0.0,
"memory": 495284224,
"time": 1753817725.397159,
"host_net_io": {
"read_bps": 0.0,
"write_bps": 0.0
},
"host_disk_io": {
"read_bps": 0.0,
"write_bps": 0.0
},
"num_fds": 25,
"gpu-memory-total": 42949672960,
"gpu_utilization": 0,
"gpu_memory_used": 668008448,
"rmm": {
"rmm-used": 0,
"rmm-total": 0
}
},
"status": "running",
"nanny": "tcp://127.0.0.1:38941"
},
"tcp://127.0.0.1:35497": {
"type": "Worker",
"id": 1,
"host": "127.0.0.1",
"resources": {},
"local_directory": "/tmp/slurmtmp.1908664/dask-scratch-space/worker-cd3q7p5s",
"name": 1,
"nthreads": 1,
"memory_limit": 253356933120,
"last_seen": 1753817725.5525525,
"services": {
"dashboard": 42907
},
"metrics": {
"task_counts": {},
"bandwidth": {
"total": 100000000,
"workers": {},
"types": {}
},
"digests_total_since_heartbeat": {},
"managed_bytes": 0,
"spilled_bytes": {
"memory": 0,
"disk": 0
},
"transfer": {
"incoming_bytes": 0,
"incoming_count": 0,
"incoming_count_total": 0,
"outgoing_bytes": 0,
"outgoing_count": 0,
"outgoing_count_total": 0
},
"event_loop_interval": 0.02,
"cpu": 0.0,
"memory": 494931968,
"time": 1753817725.3971972,
"host_net_io": {
"read_bps": 0.0,
"write_bps": 0.0
},
"host_disk_io": {
"read_bps": 0.0,
"write_bps": 0.0
},
"num_fds": 25,
"gpu-memory-total": 42949672960,
"gpu_utilization": 0,
"gpu_memory_used": 668008448,
"rmm": {
"rmm-used": 0,
"rmm-total": 0
}
},
"status": "running",
"nanny": "tcp://127.0.0.1:42639"
}
}
=== GPU ASSIGNMENTS ===
{
"tcp://127.0.0.1:35293": 0,
"tcp://127.0.0.1:35497": 0
}
Original data (blobs):
[[8.699719 3.1151562 1.265141 ]
[8.702976 3.1162481 1.257746 ]
[8.705302 3.1097922 1.268311 ]
[8.690783 3.119139 1.2682059]
[8.692694 3.1102924 1.267426 ]
[8.709211 3.090703 1.2538605]]
Diagnosis:
- The program didn't crash instead just hangs indefinitely at
fit_transform
- Both workers report
device_id
: 0 instead of 0 and 1 - Workers have different GPU UUIDs in names but same CUDA context ID
Attempted Solutions:
We tried these but wasn't able to resolve the problem:
- Using external Dask scheduler with manual worker spawning
- Trying different network configurations
- Test using GPU IDs for
CUDA_VISIBLE_DEVICES
instead of indexes - Test specifying number of workers explicitly to 2
- Adding
CUDA_DEVICE_ORDER=PCI_BUS_ID
prior to commands
We can successfully execute the script using one worker/one GPU.
Environment:
Dask version: 2025.5.0
Python version: Python 3.10.13
Operating System: Red Hat Enterprise Linux 9.4 (Plow)
dask-cuda version: 25.06.00
cuML version: 25.06.00
CUDA version: 12.4.131
Install method: uv (pip-compatible)
Hardware: Dell PowerEdge R7545 with dual NVIDIA A100-PCIE-40GB GPUs
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-6bb39071-3763-ac4d-cae5-a5353c65567e)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-6f66494d-ba59-fab0-e923-fc7ed318d6e9)
Network: HDR100 InfiniBand (interface: ibp65s0)
Thanks in advance for your time!