Skip to content

Cover custom sdpa + kv cache + quant in CI #75

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 6, 2025

Conversation

guangy10
Copy link
Collaborator

@guangy10 guangy10 commented Jun 6, 2025

As titled, showing all supported optimizations are composable and can be combined to achieve better perf.

@guangy10 guangy10 marked this pull request as ready for review June 6, 2025 01:28
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@guangy10 guangy10 requested a review from echarlaix June 6, 2025 01:28
@guangy10
Copy link
Collaborator Author

guangy10 commented Jun 6, 2025

@kimishpatel for review

Comment on lines 136 to 127
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = ExecuTorchModelForCausalLM.from_pretrained(
model_id,
recipe="xnnpack",
attn_implementation="custom_sdpa",
use_custom_kv_cache=True,
**{"qlinear": True, "qembeeding": True},
)
self.assertIsInstance(model, ExecuTorchModelForCausalLM)
self.assertIsInstance(model.model, ExecuTorchModule)
generated_text = model.text_generation(
tokenizer=tokenizer,
prompt="Simply put, the theory of relativity states that",
max_seq_len=32,
)
logging.info(f"\nGenerated text:\n\t{generated_text}")
generated_tokens = tokenizer(generated_text, return_tensors="pt").input_ids

# Free memory before loading eager for quality check
del model
del tokenizer
gc.collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can factor out a bunch of this code in util so only one place needs update

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking to parameterize these tests when I have time, but it’s not a priority, only the coverage in CI matters atm to ensure all pieces can work together. I can file a github task to refactor these tests

Copy link
Collaborator

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but some refactor would be nice

@guangy10
Copy link
Collaborator Author

guangy10 commented Jun 6, 2025

Will check the CI failures tomorrow

@guangy10
Copy link
Collaborator Author

guangy10 commented Jun 6, 2025

@kimishpatel It looks like the custom kv cache only works with static cache. For gemma3, the cache must be hybrid. To leverage the custom kv cache, I think additional work may be needed. https://github.com/huggingface/optimum-executorch/actions/runs/15480739347/job/43586007691?pr=75#step:5:1855

@guangy10 guangy10 force-pushed the improve_tests branch 2 times, most recently from 11a411d to 2df1009 Compare June 6, 2025 17:28
@guangy10 guangy10 merged commit 1c653dc into huggingface:main Jun 6, 2025
106 of 107 checks passed
@guangy10 guangy10 deleted the improve_tests branch June 6, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants