You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Performance of dizoo/atari/example/atari_dqn_dist_rdma.py
28
+
- memory: "32Gi"
29
+
- cpu: 16
30
+
- gpu: A100
31
+
32
+
33
+
| test case(unit:s) | avg |
34
+
| ----------------- | ------- |
35
+
| TCP-nng | 127.64 |
36
+
| torchrpc-CP | 29.3906 |
37
+
| torchrpc-IB | 28.7763 |
38
+
39
+
3
40
## Problems you may encounter
4
41
5
42
Message queue of Torchrpc uses [tensorpipe](https://github.com/pytorch/tensorpipe) as a communication backend, a high-performance modular tensor-p2p communication library. However, several tensorpipe defects have been found in the test, which may make it difficult for you to use it.
@@ -10,4 +47,8 @@ Tensorpipe is not container aware. Processes can find themselves on the same phy
10
47
11
48
### 2. RDMA and fork subprocess
12
49
13
-
Tensorpipe does not consider the case of calling [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html) when using RDMA. If the corresponding initialization measures are not performed when using RDMA, using fork will cause serious problems, refer to [here](https://www.rdmamojo.com/2012/05/24/ibv_fork_init/). Therefore, if you start ditask in the IB/RoCE network environment, please specify the environment variables `IBV_FORK_SAFE=1` and `RDMAV_FORK_SAFE=1` , so that ibverbs will automatically initialize fork support.
50
+
Tensorpipe does not consider the case of calling [fork(2)](https://man7.org/linux/man-pages/man2/fork.2.html) when using RDMA. If the corresponding initialization measures are not performed when using RDMA, using fork will cause serious problems, refer to [here](https://www.rdmamojo.com/2012/05/24/ibv_fork_init/). Therefore, if you start ditask in the IB/RoCE network environment, please specify the environment variables `IBV_FORK_SAFE=1` and `RDMAV_FORK_SAFE=1` , so that ibverbs will automatically initialize fork support.
51
+
52
+
### 3. GPU direct RDMA
53
+
54
+
If you use torchrpc in an environment that supports GPU direct RDMA, if the size of the tensor transmitted in rpc is very small (less than 32B), segmentfault may occur. See [issue.](https://github.com/pytorch/pytorch/issues/57136) We are tracking this bug and hope it can be resolved eventually.
0 commit comments