Optimize low latency combine recv kernel (about 3.0x speedup) #248

fzyzcjy · 2025-06-23T14:07:21Z

EDIT: The code is ready and only needs cleanup. Since LyricZhao is too busy recently, I will do code cleanup when lyriczhao has time to merge PRs.

For reviewers: Please do not merge this PR directly, since it contains too many unrelated changes. Just ping me and I will split it into correct pieces.

Note that the code is pretty ugly messy hacky. I will cleanup code later if you think the PR looks acceptable. The real (not cleaned yet) code is only in internode_ll.cu, and this PR also contains many unrelated code on other PRs.

WARN: maybe I did something wrong since pretty tired now. I will recheck everything later when having some time, and also do other optimizations.

I personally care about combine-recv kernel separately from combine-send kernel, b/c I try to overlap the latter with gemm, while the former (for simplicity) may be directly executed.

before: 47-60us

[rank 1] Dispatch + combine bandwidth: 509.39 GB/s, avg_t=261.84 us, min_t=257.63 us, max_t=267.97 us
[rank 3] Dispatch + combine bandwidth: 508.68 GB/s, avg_t=262.20 us, min_t=257.98 us, max_t=266.98 us
[rank 2] Dispatch + combine bandwidth: 510.61 GB/s, avg_t=261.21 us, min_t=256.29 us, max_t=267.87 us
[rank 0] Dispatch + combine bandwidth: 510.74 GB/s, avg_t=261.15 us, min_t=256.77 us, max_t=267.07 us
[rank 3] Dispatch bandwidth: 498.46 GB/s, avg_t=91.16 us | Combine bandwidth: 545.72 GB/s, avg_t=161.14 us
[rank 1] Dispatch bandwidth: 515.51 GB/s, avg_t=88.15 us | Combine bandwidth: 534.73 GB/s, avg_t=164.45 us
[rank 2] Dispatch bandwidth: 505.06 GB/s, avg_t=89.97 us | Combine bandwidth: 539.39 GB/s, avg_t=163.03 us
[rank 0] Dispatch bandwidth: 510.75 GB/s, avg_t=88.97 us | Combine bandwidth: 536.92 GB/s, avg_t=163.78 us
[rank 0] Dispatch send/recv time: 87.58 = 67.97 + 19.61 us | Combine send/recv time: 163.52 = 103.50 + 60.02 us
[rank 1] Dispatch send/recv time: 86.07 = 67.26 + 18.81 us | Combine send/recv time: 153.79 = 100.97 + 52.82 us
[rank 3] Dispatch send/recv time: 89.06 = 69.99 + 19.07 us | Combine send/recv time: 147.44 = 99.63 + 47.81 us
[rank 2] Dispatch send/recv time: 87.23 = 69.46 + 17.77 us | Combine send/recv time: 160.71 = 101.15 + 59.56 us

(old) ~~after: 25-30us~~

[rank 2] Dispatch + combine bandwidth: 565.85 GB/s, avg_t=235.71 us, min_t=231.68 us, max_t=238.85 us
[rank 0] Dispatch + combine bandwidth: 566.71 GB/s, avg_t=235.35 us, min_t=231.04 us, max_t=242.14 us
[rank 1] Dispatch + combine bandwidth: 567.44 GB/s, avg_t=235.05 us, min_t=228.22 us, max_t=243.97 us
[rank 3] Dispatch + combine bandwidth: 565.72 GB/s, avg_t=235.77 us, min_t=231.33 us, max_t=238.72 us
[rank 0] Dispatch bandwidth: 510.00 GB/s, avg_t=89.10 us | Combine bandwidth: 633.10 GB/s, avg_t=138.90 us
[rank 2] Dispatch bandwidth: 508.28 GB/s, avg_t=89.40 us | Combine bandwidth: 634.00 GB/s, avg_t=138.70 us
[rank 3] Dispatch bandwidth: 503.92 GB/s, avg_t=90.17 us | Combine bandwidth: 639.68 GB/s, avg_t=137.47 us
[rank 1] Dispatch bandwidth: 508.36 GB/s, avg_t=89.39 us | Combine bandwidth: 637.05 GB/s, avg_t=138.04 us
[rank 1] Dispatch send/recv time: 88.20 = 69.39 + 18.81 us | Combine send/recv time: 126.42 = 101.29 + 25.13 us
[rank 2] Dispatch send/recv time: 87.61 = 69.31 + 18.29 us | Combine send/recv time: 132.13 = 101.60 + 30.52 us
[rank 3] Dispatch send/recv time: 88.29 = 69.07 + 19.23 us | Combine send/recv time: 125.68 = 99.55 + 26.13 us
[rank 0] Dispatch send/recv time: 86.82 = 67.39 + 19.43 us | Combine send/recv time: 133.65 = 102.91 + 30.73 us

(old) ~~after: 21-26us~~

[rank 0] Dispatch + combine bandwidth: 560.83 GB/s, avg_t=237.82 us, min_t=233.95 us, max_t=242.56 us
[rank 2] Dispatch + combine bandwidth: 560.74 GB/s, avg_t=237.86 us, min_t=232.93 us, max_t=241.82 us
[rank 1] Dispatch + combine bandwidth: 559.85 GB/s, avg_t=238.24 us, min_t=232.26 us, max_t=242.37 us
[rank 3] Dispatch + combine bandwidth: 562.06 GB/s, avg_t=237.30 us, min_t=232.45 us, max_t=242.88 us
[rank 3] Dispatch bandwidth: 501.37 GB/s, avg_t=90.63 us | Combine bandwidth: 674.98 GB/s, avg_t=130.28 us
[rank 2] Dispatch bandwidth: 506.90 GB/s, avg_t=89.64 us | Combine bandwidth: 663.74 GB/s, avg_t=132.49 us
[rank 0] Dispatch bandwidth: 515.21 GB/s, avg_t=88.20 us | Combine bandwidth: 659.64 GB/s, avg_t=133.31 us
[rank 1] Dispatch bandwidth: 511.92 GB/s, avg_t=88.77 us | Combine bandwidth: 667.82 GB/s, avg_t=131.68 us
[rank 1] Dispatch send/recv time: 87.61 = 68.74 + 18.86 us | Combine send/recv time: 124.43 = 100.87 + 23.56 us
[rank 2] Dispatch send/recv time: 87.67 = 18.31 + 69.80 us | Combine send/recv time: 126.61 = 101.33 + 25.28 us
[rank 0] Dispatch send/recv time: 87.60 = 67.58 + 20.02 us | Combine send/recv time: 129.24 = 103.40 + 25.83 us
[rank 3] Dispatch send/recv time: 88.53 = 69.08 + 19.45 us | Combine send/recv time: 119.88 = 98.93 + 20.95 us

~~after: 18.9-19.4us~~

[rank 2] Dispatch + combine bandwidth: 579.30 GB/s, avg_t=230.24 us, min_t=224.48 us, max_t=235.04 us
[rank 1] Dispatch + combine bandwidth: 577.42 GB/s, avg_t=230.99 us, min_t=225.92 us, max_t=238.27 us
[rank 0] Dispatch + combine bandwidth: 577.39 GB/s, avg_t=231.00 us, min_t=226.11 us, max_t=234.88 us
[rank 3] Dispatch + combine bandwidth: 579.84 GB/s, avg_t=230.02 us, min_t=224.70 us, max_t=237.82 us
[rank 0] Dispatch bandwidth: 516.30 GB/s, avg_t=88.01 us | Combine bandwidth: 697.62 GB/s, avg_t=126.05 us
[rank 2] Dispatch bandwidth: 508.33 GB/s, avg_t=89.39 us | Combine bandwidth: 703.44 GB/s, avg_t=125.01 us
[rank 1] Dispatch bandwidth: 512.04 GB/s, avg_t=88.74 us | Combine bandwidth: 703.73 GB/s, avg_t=124.96 us
[rank 3] Dispatch bandwidth: 505.60 GB/s, avg_t=89.88 us | Combine bandwidth: 702.96 GB/s, avg_t=125.10 us
[rank 0] Dispatch send/recv time: 85.99 = 66.52 + 19.47 us | Combine send/recv time: 121.71 = 102.72 + 19.00 us
[rank 1] Dispatch send/recv time: 87.27 = 68.54 + 18.73 us | Combine send/recv time: 120.28 = 101.04 + 19.24 us
[rank 2] Dispatch send/recv time: 87.15 = 68.95 + 18.20 us | Combine send/recv time: 119.64 = 100.71 + 18.94 us
[rank 3] Dispatch send/recv time: 87.94 = 68.78 + 19.17 us | Combine send/recv time: 119.07 = 99.69 + 19.37 us

after: 17.5-17.7us
3.04x speedup

[rank 1] Dispatch + combine bandwidth: 591.47 GB/s, avg_t=225.50 us, min_t=218.24 us, max_t=230.11 us
[rank 0] Dispatch + combine bandwidth: 591.01 GB/s, avg_t=225.68 us, min_t=219.97 us, max_t=230.59 us
[rank 2] Dispatch + combine bandwidth: 589.63 GB/s, avg_t=226.20 us, min_t=219.04 us, max_t=233.57 us
[rank 3] Dispatch + combine bandwidth: 588.74 GB/s, avg_t=226.55 us, min_t=221.34 us, max_t=232.29 us
[rank 3] Dispatch bandwidth: 501.61 GB/s, avg_t=90.59 us | Combine bandwidth: 739.42 GB/s, avg_t=118.93 us
[rank 0] Dispatch bandwidth: 502.56 GB/s, avg_t=90.42 us | Combine bandwidth: 741.52 GB/s, avg_t=118.59 us
[rank 2] Dispatch bandwidth: 510.22 GB/s, avg_t=89.06 us | Combine bandwidth: 727.57 GB/s, avg_t=120.86 us
[rank 1] Dispatch bandwidth: 516.93 GB/s, avg_t=87.91 us | Combine bandwidth: 723.02 GB/s, avg_t=121.62 us
[rank 1] Dispatch send/recv time: 86.35 = 67.66 + 18.69 us | Combine send/recv time: 118.72 = 101.16 + 17.56 us
[rank 0] Dispatch send/recv time: 88.51 = 68.58 + 19.92 us | Combine send/recv time: 118.35 = 100.85 + 17.50 us
[rank 2] Dispatch send/recv time: 87.44 = 68.79 + 18.66 us | Combine send/recv time: 118.51 = 100.82 + 17.68 us
[rank 3] Dispatch send/recv time: 88.14 = 69.25 + 18.89 us | Combine send/recv time: 117.99 = 100.31 + 17.68 us

(cherry picked from commit df72cff)

# Conflicts: # tests/test_intranode.py # tests/utils.py

This reverts commit d559fd7.

This reverts commit 37d2d2b.

This reverts commit f4186f7.

This reverts commit 3c21dc3.

This reverts commit d755602.

fzyzcjy · 2025-07-05T13:33:00Z

brainstorm: maybe we can carry information during dispatch, such that when doing combine, we have a buffer of shape (num_tokens, num_topk, hidden) and directly send data into it. Then we do not need to read topk_idx and save time. Also save a bit of memory.

share this first instead of implementing it b/c want to know whether this conflicts w/ any ongoing changes?

EDIT: btw another minor update is to fully use 1024 threads s.t. wave num reduce by one

This reverts commit 2f8b264.

fzyzcjy added 30 commits June 21, 2025 07:42

more

db53053

more

2278722

more

661e188

more

5567637

more

20855ee

more

ad11318

more

a960335

Merge branch 'feat/test_detailed_time' into feat/dev_20250621

9a8d98a

cherry pick

681bdc5

(cherry picked from commit df72cff)

Merge branch 'feat/num_processes' into feat/dev_20250621

7421672

# Conflicts: # tests/test_intranode.py # tests/utils.py

more

56758db

more

0a8848a

more

cd4af65

add profile

d559938

hack

c30eca6

more

eefd72c

more

932a5b8

more

626efdc

more

d0ba512

more

2437fad

more

35cd4ee

more

5f28f95

more

f4aa019

more

981ae58

more

38d03cf

more

4c71e4c

more

eb8ddd9

more

2b22134

topk_idx i32

48162a4

temp hack

786760c

fzyzcjy added 20 commits July 5, 2025 18:51

temp

5c00dfd

temp

777bb20

temp

f7ea7a5

revert temp

16cbbea

more

e26dbe5

more

7fe0a4f

more

6d49b1c

more

a0cdae1

hack

7250dd5

revert temp

6892f3f

hack

d559fd7

Revert "hack"

910b899

This reverts commit d559fd7.

hack

37d2d2b

Revert "hack"

f5b4d76

This reverts commit 37d2d2b.

hack

f4186f7

Revert "hack"

e8d50a3

This reverts commit f4186f7.

hack

3c21dc3

Revert "hack"

6339525

This reverts commit 3c21dc3.

hack

d755602

Revert "hack"

3291302

This reverts commit d755602.

fzyzcjy added 8 commits July 6, 2025 09:35

more

977d341

more

e389304

more

6c7d6a7

more

739b7ff

hack

2f8b264

Revert "hack"

a50dadc

This reverts commit 2f8b264.

more

ffb0052

more

0b894c4

sphish force-pushed the main branch from 8ff19f5 to bdd119f Compare July 22, 2025 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize low latency combine recv kernel (about 3.0x speedup) #248

Optimize low latency combine recv kernel (about 3.0x speedup) #248

Uh oh!

fzyzcjy commented Jun 23, 2025 •

edited

Loading

Uh oh!

fzyzcjy commented Jul 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Optimize low latency combine recv kernel (about 3.0x speedup) #248

Are you sure you want to change the base?

Optimize low latency combine recv kernel (about 3.0x speedup) #248

Uh oh!

Conversation

fzyzcjy commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Jun 23, 2025 •

edited

Loading

fzyzcjy commented Jul 5, 2025 •

edited

Loading