Skip to content

Enable prefill for running CausalLM using ET runtime #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

guangy10
Copy link
Collaborator

@guangy10 guangy10 commented Jun 4, 2025

Summary

Add support to utilize ExecuTorch runtime to prefill prompt tokens. This PR is enabling prefill via the HF's python API.

Size comparison:

[  96]  qwen3_no_prefill
└── [914M]  model.pte
[  96]  qwen3_prefill
└── [914M]  model.pte

There is no big impact on PTE file size given the upper bound size of 128 in this test.

Perf comparison:

time_to_first_token is 20x faster. See details in the tests below:

1. Generation with prefill enabled
The PTE is exported with dynamic shapes supported at the seq_len dim for both input_ids and cache_position.

Time to first generated token: 0.658000s

Model loaded from qwen3_prefill/model.pte

⚠️ DISCLAIMER: Python-based perf measurements are approximate and may not match absolute speeds on Android/iOS apps. They are intended for relative comparisons—-e.g. SDPA vs. custom SDPA, FP16 vs. FP32—-so you can gauge performance improvements from each optimization step. For end-to-end, platform-accurate benchmarks, please use the official ExecuTorch apps:
  • iOS:     https://github.com/pytorch/executorch/tree/main/extension/benchmark/apple/Benchmark
  • Android: https://github.com/pytorch/executorch/tree/main/extension/benchmark/android/benchmark

PyTorchObserver {"prompt_tokens": 98, "generated_tokens": 2, "model_load_start_ms": 0, "model_load_end_ms": 0, "inference_start_ms": 1750462921337, "token_encode_end_ms": 1750462921338, "model_execution_start_ms": 1750462921836, "model_execution_end_ms": 1750462921995, "inference_end_ms": 1750462921995, "prompt_eval_end_ms": 1750462921834, "first_token_ms": 1750462921995, "aggregate_sampling_time_ms": 655, "SCALING_FACTOR_UNITS_PER_SECOND": 1000}
	Prompt Tokens: 98 Generated Tokens: 2
	Model Load Time:		0.000000 (seconds)
	Total inference time:		0.658000 (seconds)		 Rate: 	3.039514 (tokens/second)
		Prompt evaluation:	0.497000 (seconds)		 Rate: 	197.183099 (tokens/second)
		Generated 2 tokens:	0.161000 (seconds)		 Rate: 	12.422360 (tokens/second)
	Time to first generated token:	0.658000 (seconds)
	Sampling time over 100 tokens:	0.655000 (seconds)
The Supreme Court is being asked to decide whether the PTAB can invalidate expired patents through inter partes review, in a case that could limit the reach of the landmark Oil States decision that found IPRs constitutionally permissive. In Apple Inc. v. Gesture Technology Partners, the Federal Circuit ruled that PTAB retains jurisdiction over expired patents, but the patentee argues in its petition that once patents expire, they become purely private property rights that require traditional court adjudication rather than administrative review. The.

2. Generation WITHOUT prefill
This is taking a PTE that is generated statically. The prompt tokens are processed sequentially.

Time to first generated token: 15.547000s

Model loaded from qwen3_no_prefill/model.pte

⚠️ DISCLAIMER: Python-based perf measurements are approximate and may not match absolute speeds on Android/iOS apps. They are intended for relative comparisons—-e.g. SDPA vs. custom SDPA, FP16 vs. FP32—-so you can gauge performance improvements from each optimization step. For end-to-end, platform-accurate benchmarks, please use the official ExecuTorch apps:
  • iOS:     https://github.com/pytorch/executorch/tree/main/extension/benchmark/apple/Benchmark
  • Android: https://github.com/pytorch/executorch/tree/main/extension/benchmark/android/benchmark

PyTorchObserver {"prompt_tokens": 98, "generated_tokens": 2, "model_load_start_ms": 0, "model_load_end_ms": 0, "inference_start_ms": 1750462984000, "token_encode_end_ms": 1750462984001, "model_execution_start_ms": 1750462999392, "model_execution_end_ms": 1750462999547, "inference_end_ms": 1750462999548, "prompt_eval_end_ms": 1750462999392, "first_token_ms": 1750462999547, "aggregate_sampling_time_ms": 15546, "SCALING_FACTOR_UNITS_PER_SECOND": 1000}
	Prompt Tokens: 98 Generated Tokens: 2
	Model Load Time:		0.000000 (seconds)
	Total inference time:		15.548000 (seconds)		 Rate: 	0.128634 (tokens/second)
		Prompt evaluation:	15.392000 (seconds)		 Rate: 	6.366944 (tokens/second)
		Generated 2 tokens:	0.156000 (seconds)		 Rate: 	12.820513 (tokens/second)
	Time to first generated token:	15.547000 (seconds)
	Sampling time over 100 tokens:	15.546000 (seconds)
The Supreme Court is being asked to decide whether the PTAB can invalidate expired patents through inter partes review, in a case that could limit the reach of the landmark Oil States decision that found IPRs constitutionally permissive. In Apple Inc. v. Gesture Technology Partners, the Federal Circuit ruled that PTAB retains jurisdiction over expired patents, but the patentee argues in its petition that once patents expire, they become purely private property rights that require traditional court adjudication rather than administrative review. review review

3. Generation WITHOUT prefill (backwards compatbility)
It's important to make sure compatibility. That is, the PTE is exported with dynamic shapes can still be loaded and run with the old code, it will just degrade to sequential processing as case 2.

Time to first generated token: 15.818000s

Model loaded from qwen3_prefill/model.pte

⚠️ DISCLAIMER: Python-based perf measurements are approximate and may not match absolute speeds on Android/iOS apps. They are intended for relative comparisons—-e.g. SDPA vs. custom SDPA, FP16 vs. FP32—-so you can gauge performance improvements from each optimization step. For end-to-end, platform-accurate benchmarks, please use the official ExecuTorch apps:
  • iOS:     https://github.com/pytorch/executorch/tree/main/extension/benchmark/apple/Benchmark
  • Android: https://github.com/pytorch/executorch/tree/main/extension/benchmark/android/benchmark

PyTorchObserver {"prompt_tokens": 98, "generated_tokens": 2, "model_load_start_ms": 0, "model_load_end_ms": 0, "inference_start_ms": 1750462939243, "token_encode_end_ms": 1750462939244, "model_execution_start_ms": 1750462954901, "model_execution_end_ms": 1750462955061, "inference_end_ms": 1750462955062, "prompt_eval_end_ms": 1750462954901, "first_token_ms": 1750462955061, "aggregate_sampling_time_ms": 15817, "SCALING_FACTOR_UNITS_PER_SECOND": 1000}
	Prompt Tokens: 98 Generated Tokens: 2
	Model Load Time:		0.000000 (seconds)
	Total inference time:		15.819000 (seconds)		 Rate: 	0.126430 (tokens/second)
		Prompt evaluation:	15.658000 (seconds)		 Rate: 	6.258781 (tokens/second)
		Generated 2 tokens:	0.161000 (seconds)		 Rate: 	12.422360 (tokens/second)
	Time to first generated token:	15.818000 (seconds)
	Sampling time over 100 tokens:	15.817000 (seconds)
The Supreme Court is being asked to decide whether the PTAB can invalidate expired patents through inter partes review, in a case that could limit the reach of the landmark Oil States decision that found IPRs constitutionally permissive. In Apple Inc. v. Gesture Technology Partners, the Federal Circuit ruled that PTAB retains jurisdiction over expired patents, but the patentee argues in its petition that once patents expire, they become purely private property rights that require traditional court adjudication rather than administrative review. review review

@guangy10
Copy link
Collaborator Author

guangy10 commented Jun 4, 2025

cc: @kimishpatel

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@guangy10 guangy10 force-pushed the prefill branch 2 times, most recently from 5f98c87 to f735b5f Compare June 9, 2025 22:21
@guangy10 guangy10 marked this pull request as ready for review June 13, 2025 18:48
@guangy10 guangy10 force-pushed the prefill branch 2 times, most recently from 2cdae1e to c308c15 Compare June 18, 2025 17:57
@guangy10 guangy10 requested a review from kimishpatel June 20, 2025 23:56
@guangy10
Copy link
Collaborator Author

Kernel fixed from executorch has been integrated in #89. Rebased the PR and updated the summary

@guangy10
Copy link
Collaborator Author

Some of the failures are due to quality degradation from the transformers pin bump. Will probably need #81 to fix the quality when using custom_sdpa_kv_cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants