Skip to content

Add support for sliding window via ring kv cache #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 25, 2025

Conversation

kimishpatel
Copy link
Collaborator

Summary:
This diffs adds

  • Ring kv cache for local attention where ring buffer updates cache with new entries in a ring fashion rather than sliding cache
  • Correspondingly we have to update sdpa to accept mask that correspond to the state of ring buffer because slot in ring buffer may correspond to different token positions. Thus we cannot just use causal mask.

Now custom_kv_cache option means we support both global attention and local/global like gemma3

Test Plan:
Test added

Reviewers:

Subscribers:

Tasks:

Tags:

@guangy10 guangy10 self-requested a review June 10, 2025 20:13
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@kimishpatel
Copy link
Collaborator Author

@guangy10 I think I am gonna have to upgrade executorch version to make CustomRingKVCache available

@guangy10
Copy link
Collaborator

@guangy10 I think I am gonna have to upgrade executorch version to make CustomRingKVCache available

ET nightly build was broken since 5/27 last time I checked. But I think it was fixed last night. I can check.

@kimishpatel
Copy link
Collaborator Author

@guangy10 I think I am gonna have to upgrade executorch version to make CustomRingKVCache available

ET nightly build was broken since 5/27 last time I checked. But I think it was fixed last night. I can check.

ok gated ci on et version. Please review

@guangy10
Copy link
Collaborator

Pretty much all gemma3 tests are failing export. Does this change require bumping transformers? I suspect if that's the reason of the failures in CI

@kimishpatel
Copy link
Collaborator Author

Pretty much all gemma3 tests are failing export. Does this change require bumping transformers? I suspect if that's the reason of the failures in CI

Let me try to repro the failure

@kimishpatel kimishpatel force-pushed the add_ring_kv_cache branch 4 times, most recently from fe38f14 to 411885d Compare June 18, 2025 20:50
_custom_sdpa_for_ring_kv_cache = get_custom_sdpa_for_ring_kv_cache(exportable_module)
AttentionInterface.register("custom_sdpa_ring_kv_cache", _custom_sdpa_for_ring_kv_cache)
AttentionMaskInterface.register("custom_sdpa_ring_kv_cache", sdpa_mask_without_vmap)
exportable_module.model.model.config._attn_implementation = "custom_sdpa_ring_kv_cache"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the reason why it works

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimishpatel Maybe we should abort #86 into this PR so this PR will be testable in the CI?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow why would this line "fixes" the quality issue in other models like smollm that are not using ring_kv_cache. Do you mind elaborate a bit?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that in Cyril's vision the static cache will be a special case of hybrid cache w/o sliding window cache then all models can technically work with hybrid cache then. Is it something you are referring here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimishpatel Can you absorb #86 into this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow why would this line "fixes" the quality issue in other models like smollm that are not using ring_kv_cache. Do you mind elaborate a bit?

Mainly because it is not gemma3/local attention specific. custom_sdpa_ring_kv_cache handles both global and sliding window attention

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that in Cyril's vision the static cache will be a special case of hybrid cache w/o sliding window cache then all models can technically work with hybrid cache then. Is it something you are referring here?

yeah kind of

Summary:
This diffs adds
- Ring kv cache for local attention where ring buffer updates cache
  with new entries in a ring fashion rather than sliding cache
- Correspondingly we have to update sdpa to accept mask that correspond
  to the state of ring buffer because slot in ring buffer may correspond
  to different token positions. Thus we cannot just use causal mask.

Now custom_kv_cache option means we support both global attention and
local/global like gemma3

Test Plan:
Test added

Reviewers:

Subscribers:

Tasks:

Tags:
@kimishpatel kimishpatel force-pushed the add_ring_kv_cache branch 3 times, most recently from c3731b2 to b4bf11a Compare June 25, 2025 01:00
@guangy10
Copy link
Collaborator

All look good. @kimishpatel We can get it merged if no more changes needed

@kimishpatel
Copy link
Collaborator Author

there was one change that i had to push to make it work when attn_implementation is used for custom_sdpa

@guangy10
Copy link
Collaborator

guangy10 commented Jun 25, 2025

Oh looks like you forget update the pinned commit for transformers. Without it, some gemma-3 tests won't run in the CI

Summary:
Latest changse in 4.53 requires that custom attentin functions have
corresponding mask generation

Test Plan:
CI wont pass without changes on top of 4.53

Reviewers:

Subscribers:

Tasks:

Tags:
@guangy10 guangy10 merged commit c2b20c9 into huggingface:main Jun 25, 2025
107 of 111 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants