-
Notifications
You must be signed in to change notification settings - Fork 20
Add support for sliding window via ring kv cache #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@guangy10 I think I am gonna have to upgrade executorch version to make CustomRingKVCache available |
ET nightly build was broken since 5/27 last time I checked. But I think it was fixed last night. I can check. |
2b8b948
to
f4f4bda
Compare
ok gated ci on et version. Please review |
13fa4bc
to
9d85344
Compare
Pretty much all gemma3 tests are failing export. Does this change require bumping |
Let me try to repro the failure |
fe38f14
to
411885d
Compare
_custom_sdpa_for_ring_kv_cache = get_custom_sdpa_for_ring_kv_cache(exportable_module) | ||
AttentionInterface.register("custom_sdpa_ring_kv_cache", _custom_sdpa_for_ring_kv_cache) | ||
AttentionMaskInterface.register("custom_sdpa_ring_kv_cache", sdpa_mask_without_vmap) | ||
exportable_module.model.model.config._attn_implementation = "custom_sdpa_ring_kv_cache" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be the reason why it works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimishpatel Maybe we should abort #86 into this PR so this PR will be testable in the CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow why would this line "fixes" the quality issue in other models like smollm that are not using ring_kv_cache. Do you mind elaborate a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that in Cyril's vision the static cache will be a special case of hybrid cache w/o sliding window cache then all models can technically work with hybrid cache then. Is it something you are referring here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimishpatel Can you absorb #86 into this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow why would this line "fixes" the quality issue in other models like smollm that are not using ring_kv_cache. Do you mind elaborate a bit?
Mainly because it is not gemma3/local attention specific. custom_sdpa_ring_kv_cache handles both global and sliding window attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that in Cyril's vision the static cache will be a special case of hybrid cache w/o sliding window cache then all models can technically work with hybrid cache then. Is it something you are referring here?
yeah kind of
Summary: This diffs adds - Ring kv cache for local attention where ring buffer updates cache with new entries in a ring fashion rather than sliding cache - Correspondingly we have to update sdpa to accept mask that correspond to the state of ring buffer because slot in ring buffer may correspond to different token positions. Thus we cannot just use causal mask. Now custom_kv_cache option means we support both global attention and local/global like gemma3 Test Plan: Test added Reviewers: Subscribers: Tasks: Tags:
c3731b2
to
b4bf11a
Compare
All look good. @kimishpatel We can get it merged if no more changes needed |
b4bf11a
to
7379c23
Compare
there was one change that i had to push to make it work when attn_implementation is used for custom_sdpa |
Oh looks like you forget update the pinned commit for transformers. Without it, some gemma-3 tests won't run in the CI |
7379c23
to
a79ab33
Compare
Summary: Latest changse in 4.53 requires that custom attentin functions have corresponding mask generation Test Plan: CI wont pass without changes on top of 4.53 Reviewers: Subscribers: Tasks: Tags:
a79ab33
to
918b247
Compare
Summary:
This diffs adds
Now custom_kv_cache option means we support both global attention and local/global like gemma3
Test Plan:
Test added
Reviewers:
Subscribers:
Tasks:
Tags: