Enable prefill for running CausalLM using ET runtime #73

guangy10 · 2025-06-04T23:10:25Z

Summary

Add support to utilize ExecuTorch runtime to prefill prompt tokens. This PR is enabling prefill via the HF's python API.

Size comparison:

[  96]  qwen3_no_prefill
└── [914M]  model.pte
[  96]  qwen3_prefill
└── [914M]  model.pte

There is no big impact on PTE file size given the upper bound size of 128 in this test.

Perf comparison:

time_to_first_token is 20x faster. See details in the tests below:

1. Generation with prefill enabled
The PTE is exported with dynamic shapes supported at the seq_len dim for both input_ids and cache_position.

Time to first generated token: 0.658000s

Model loaded from qwen3_prefill/model.pte

⚠️ DISCLAIMER: Python-based perf measurements are approximate and may not match absolute speeds on Android/iOS apps. They are intended for relative comparisons—-e.g. SDPA vs. custom SDPA, FP16 vs. FP32—-so you can gauge performance improvements from each optimization step. For end-to-end, platform-accurate benchmarks, please use the official ExecuTorch apps:
  • iOS:     https://github.com/pytorch/executorch/tree/main/extension/benchmark/apple/Benchmark
  • Android: https://github.com/pytorch/executorch/tree/main/extension/benchmark/android/benchmark

PyTorchObserver {"prompt_tokens": 98, "generated_tokens": 2, "model_load_start_ms": 0, "model_load_end_ms": 0, "inference_start_ms": 1750462921337, "token_encode_end_ms": 1750462921338, "model_execution_start_ms": 1750462921836, "model_execution_end_ms": 1750462921995, "inference_end_ms": 1750462921995, "prompt_eval_end_ms": 1750462921834, "first_token_ms": 1750462921995, "aggregate_sampling_time_ms": 655, "SCALING_FACTOR_UNITS_PER_SECOND": 1000}
	Prompt Tokens: 98 Generated Tokens: 2
	Model Load Time:		0.000000 (seconds)
	Total inference time:		0.658000 (seconds)		 Rate: 	3.039514 (tokens/second)
		Prompt evaluation:	0.497000 (seconds)		 Rate: 	197.183099 (tokens/second)
		Generated 2 tokens:	0.161000 (seconds)		 Rate: 	12.422360 (tokens/second)
	Time to first generated token:	0.658000 (seconds)
	Sampling time over 100 tokens:	0.655000 (seconds)
The Supreme Court is being asked to decide whether the PTAB can invalidate expired patents through inter partes review, in a case that could limit the reach of the landmark Oil States decision that found IPRs constitutionally permissive. In Apple Inc. v. Gesture Technology Partners, the Federal Circuit ruled that PTAB retains jurisdiction over expired patents, but the patentee argues in its petition that once patents expire, they become purely private property rights that require traditional court adjudication rather than administrative review. The.

2. Generation WITHOUT prefill
This is taking a PTE that is generated statically. The prompt tokens are processed sequentially.

Time to first generated token: 15.547000s

Model loaded from qwen3_no_prefill/model.pte

⚠️ DISCLAIMER: Python-based perf measurements are approximate and may not match absolute speeds on Android/iOS apps. They are intended for relative comparisons—-e.g. SDPA vs. custom SDPA, FP16 vs. FP32—-so you can gauge performance improvements from each optimization step. For end-to-end, platform-accurate benchmarks, please use the official ExecuTorch apps:
  • iOS:     https://github.com/pytorch/executorch/tree/main/extension/benchmark/apple/Benchmark
  • Android: https://github.com/pytorch/executorch/tree/main/extension/benchmark/android/benchmark

PyTorchObserver {"prompt_tokens": 98, "generated_tokens": 2, "model_load_start_ms": 0, "model_load_end_ms": 0, "inference_start_ms": 1750462984000, "token_encode_end_ms": 1750462984001, "model_execution_start_ms": 1750462999392, "model_execution_end_ms": 1750462999547, "inference_end_ms": 1750462999548, "prompt_eval_end_ms": 1750462999392, "first_token_ms": 1750462999547, "aggregate_sampling_time_ms": 15546, "SCALING_FACTOR_UNITS_PER_SECOND": 1000}
	Prompt Tokens: 98 Generated Tokens: 2
	Model Load Time:		0.000000 (seconds)
	Total inference time:		15.548000 (seconds)		 Rate: 	0.128634 (tokens/second)
		Prompt evaluation:	15.392000 (seconds)		 Rate: 	6.366944 (tokens/second)
		Generated 2 tokens:	0.156000 (seconds)		 Rate: 	12.820513 (tokens/second)
	Time to first generated token:	15.547000 (seconds)
	Sampling time over 100 tokens:	15.546000 (seconds)
The Supreme Court is being asked to decide whether the PTAB can invalidate expired patents through inter partes review, in a case that could limit the reach of the landmark Oil States decision that found IPRs constitutionally permissive. In Apple Inc. v. Gesture Technology Partners, the Federal Circuit ruled that PTAB retains jurisdiction over expired patents, but the patentee argues in its petition that once patents expire, they become purely private property rights that require traditional court adjudication rather than administrative review. review review

3. Generation WITHOUT prefill (backwards compatbility)
It's important to make sure compatibility. That is, the PTE is exported with dynamic shapes can still be loaded and run with the old code, it will just degrade to sequential processing as case 2.

Time to first generated token: 15.818000s

Model loaded from qwen3_prefill/model.pte

⚠️ DISCLAIMER: Python-based perf measurements are approximate and may not match absolute speeds on Android/iOS apps. They are intended for relative comparisons—-e.g. SDPA vs. custom SDPA, FP16 vs. FP32—-so you can gauge performance improvements from each optimization step. For end-to-end, platform-accurate benchmarks, please use the official ExecuTorch apps:
  • iOS:     https://github.com/pytorch/executorch/tree/main/extension/benchmark/apple/Benchmark
  • Android: https://github.com/pytorch/executorch/tree/main/extension/benchmark/android/benchmark

PyTorchObserver {"prompt_tokens": 98, "generated_tokens": 2, "model_load_start_ms": 0, "model_load_end_ms": 0, "inference_start_ms": 1750462939243, "token_encode_end_ms": 1750462939244, "model_execution_start_ms": 1750462954901, "model_execution_end_ms": 1750462955061, "inference_end_ms": 1750462955062, "prompt_eval_end_ms": 1750462954901, "first_token_ms": 1750462955061, "aggregate_sampling_time_ms": 15817, "SCALING_FACTOR_UNITS_PER_SECOND": 1000}
	Prompt Tokens: 98 Generated Tokens: 2
	Model Load Time:		0.000000 (seconds)
	Total inference time:		15.819000 (seconds)		 Rate: 	0.126430 (tokens/second)
		Prompt evaluation:	15.658000 (seconds)		 Rate: 	6.258781 (tokens/second)
		Generated 2 tokens:	0.161000 (seconds)		 Rate: 	12.422360 (tokens/second)
	Time to first generated token:	15.818000 (seconds)
	Sampling time over 100 tokens:	15.817000 (seconds)
The Supreme Court is being asked to decide whether the PTAB can invalidate expired patents through inter partes review, in a case that could limit the reach of the landmark Oil States decision that found IPRs constitutionally permissive. In Apple Inc. v. Gesture Technology Partners, the Federal Circuit ruled that PTAB retains jurisdiction over expired patents, but the patentee argues in its petition that once patents expire, they become purely private property rights that require traditional court adjudication rather than administrative review. review review

guangy10 · 2025-06-04T23:11:05Z

cc: @kimishpatel

HuggingFaceDocBuilderDev · 2025-06-04T23:13:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

optimum/exporters/executorch/recipes/xnnpack.py

optimum/executorch/modeling.py

optimum/exporters/executorch/recipes/xnnpack.py

optimum/exporters/executorch/integrations.py

guangy10 · 2025-06-20T23:59:18Z

Kernel fixed from executorch has been integrated in #89. Rebased the PR and updated the summary

guangy10 · 2025-06-21T00:02:04Z

Some of the failures are due to quality degradation from the transformers pin bump. Will probably need #81 to fix the quality when using custom_sdpa_kv_cache.

…missing cache_position support

kimishpatel reviewed Jun 4, 2025

View reviewed changes

optimum/exporters/executorch/recipes/xnnpack.py Outdated Show resolved Hide resolved

guangy10 force-pushed the prefill branch from f20927e to d0e3764 Compare June 4, 2025 23:33

guangy10 commented Jun 4, 2025

View reviewed changes

optimum/executorch/modeling.py Outdated Show resolved Hide resolved

guangy10 commented Jun 4, 2025

View reviewed changes

optimum/executorch/modeling.py Outdated Show resolved Hide resolved

kimishpatel reviewed Jun 9, 2025

View reviewed changes

optimum/exporters/executorch/recipes/xnnpack.py Outdated Show resolved Hide resolved

guangy10 force-pushed the prefill branch 2 times, most recently from 5f98c87 to f735b5f Compare June 9, 2025 22:21

guangy10 commented Jun 9, 2025

View reviewed changes

optimum/exporters/executorch/integrations.py Show resolved Hide resolved

guangy10 force-pushed the prefill branch from f735b5f to 7911dde Compare June 11, 2025 01:43

guangy10 marked this pull request as ready for review June 13, 2025 18:48

guangy10 force-pushed the prefill branch 2 times, most recently from 2cdae1e to c308c15 Compare June 18, 2025 17:57

guangy10 and others added 3 commits June 20, 2025 16:06

Enable prefill for running CausalLM using ET runtime

58e35a0

export cache_position dynamically

be409c3

bug fix in generate loop

678b159

guangy10 force-pushed the prefill branch from c84becd to a344ba9 Compare June 20, 2025 23:08

fix for older version of transformers

1908051

guangy10 force-pushed the prefill branch from a344ba9 to 1908051 Compare June 20, 2025 23:09

move pinned transformers for prefill

bcd8355

guangy10 requested a review from kimishpatel June 20, 2025 23:56

kimishpatel approved these changes Jun 21, 2025

View reviewed changes

guangy10 added 5 commits June 23, 2025 14:41

temporarily disble running etLLM generated phi4 in optimum-et due to …

f90c6c3

…missing cache_position support

gated prefill for executorch<=0.6 for backwards compatibility

d197570

more fixes for executorch 0.6.0

fdc3878

Bypass running whisper tests on transformers>4.52.4

d2653bc

set upper bound of seq_len dynamically based on users config

ca5982b

guangy10 force-pushed the prefill branch from 4dcf7d1 to ca5982b Compare June 24, 2025 00:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable prefill for running CausalLM using ET runtime #73

Enable prefill for running CausalLM using ET runtime #73

Uh oh!

guangy10 commented Jun 4, 2025 •

edited

Loading

Uh oh!

guangy10 commented Jun 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guangy10 commented Jun 20, 2025

Uh oh!

guangy10 commented Jun 21, 2025

Uh oh!

Uh oh!

Enable prefill for running CausalLM using ET runtime #73

Are you sure you want to change the base?

Enable prefill for running CausalLM using ET runtime #73

Uh oh!

Conversation

guangy10 commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Size comparison:

Perf comparison:

Uh oh!

guangy10 commented Jun 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guangy10 commented Jun 20, 2025

Uh oh!

guangy10 commented Jun 21, 2025

Uh oh!

Uh oh!

guangy10 commented Jun 4, 2025 •

edited

Loading