-
Notifications
You must be signed in to change notification settings - Fork 20
Cover custom sdpa + kv cache + quant in CI #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@kimishpatel for review |
tests/models/test_modeling_llama.py
Outdated
tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
model = ExecuTorchModelForCausalLM.from_pretrained( | ||
model_id, | ||
recipe="xnnpack", | ||
attn_implementation="custom_sdpa", | ||
use_custom_kv_cache=True, | ||
**{"qlinear": True, "qembeeding": True}, | ||
) | ||
self.assertIsInstance(model, ExecuTorchModelForCausalLM) | ||
self.assertIsInstance(model.model, ExecuTorchModule) | ||
generated_text = model.text_generation( | ||
tokenizer=tokenizer, | ||
prompt="Simply put, the theory of relativity states that", | ||
max_seq_len=32, | ||
) | ||
logging.info(f"\nGenerated text:\n\t{generated_text}") | ||
generated_tokens = tokenizer(generated_text, return_tensors="pt").input_ids | ||
|
||
# Free memory before loading eager for quality check | ||
del model | ||
del tokenizer | ||
gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can factor out a bunch of this code in util so only one place needs update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking to parameterize these tests when I have time, but it’s not a priority, only the coverage in CI matters atm to ensure all pieces can work together. I can file a github task to refactor these tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good but some refactor would be nice
Will check the CI failures tomorrow |
@kimishpatel It looks like the custom kv cache only works with static cache. For gemma3, the cache must be hybrid. To leverage the custom kv cache, I think additional work may be needed. https://github.com/huggingface/optimum-executorch/actions/runs/15480739347/job/43586007691?pr=75#step:5:1855 |
11a411d
to
2df1009
Compare
As titled, showing all supported optimizations are composable and can be combined to achieve better perf.