Bump vllm from 0.9.2 to 0.10.0 #284

dependabot · 2025-07-25T22:14:57Z

Bumps vllm from 0.9.2 to 0.10.0.

Release notes

v0.10.0

Highlights

v0.10.0 release includes 308 commits, 168 contributors (62 new!).

NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.

Model Support

New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, #20625, #20820), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).

Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).

Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).

VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).

Engine Core

Experimental async scheduling --async-scheduling flag to overlap engine core scheduling with GPU runner (#19970).

V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).

Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).

RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).

Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).

Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).

Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).

Hardwares & Performance

NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).

Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).

Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).

Quantization

New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, #21100), in-flight quantization for MoE (#20061).

Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).

Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).

API & Frontend

OpenAI compatibility: Responses API implementation (#20504, #20975), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).

New endpoints: get_tokenizer_info for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981).

Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).

CLI improvements: --help=page option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).

Dependencies

Updated PyTorch to 2.7.1 for CUDA (#21011)

FlashInfer updated to v0.2.8rc1 (#20718)

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in vllm-project/vllm#19426

[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in vllm-project/vllm#19440

[Model] use AutoWeightsLoader for commandr by @py-andy-c in vllm-project/vllm#19399

Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in vllm-project/vllm#19401

[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in vllm-project/vllm#19390

[New Model]: Support Qwen3 Embedding & Reranker by @noooop in vllm-project/vllm#19260

[BugFix] Fix docker build cpu-dev image error by @2niuhe in vllm-project/vllm#19394

Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in vllm-project/vllm#19451

... (truncated)

Commits

6d8d0a2 Add think chunk (#21333)
11ef7a6 [BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses (#21211)
dc2f159 Dump input metadata on crash for async scheduling (#21258)
d5b981f [DP] Internal Load Balancing Per Node [one-pod-per-node] (#21238)
eec6942 [BugFix] Fix KVConnector TP worker aggregation (#21473)
fd48d99 [BugFix]: Batch generation from prompt_embeds fails for long prompts (#21390)
f8c15c4 [Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process (#21437)
aa08a95 [Bugfix] Fix casing warning (#21468)
13e4ee1 [XPU][UT] increase intel xpu CI test scope (#21492)
772ce5a [Misc] Add dummy maverick test to CI (#21324)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [vllm](https://github.com/vllm-project/vllm) from 0.9.2 to 0.10.0. - [Release notes](https://github.com/vllm-project/vllm/releases) - [Changelog](https://github.com/vllm-project/vllm/blob/main/RELEASE.md) - [Commits](vllm-project/vllm@v0.9.2...v0.10.0) --- updated-dependencies: - dependency-name: vllm dependency-version: 0.10.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>

dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Jul 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bump vllm from 0.9.2 to 0.10.0 #284

Bump vllm from 0.9.2 to 0.10.0 #284

Uh oh!

dependabot bot commented on behalf of github Jul 25, 2025

Uh oh!

Uh oh!

Bump vllm from 0.9.2 to 0.10.0 #284

Are you sure you want to change the base?

Bump vllm from 0.9.2 to 0.10.0 #284

Uh oh!

Conversation

dependabot bot commented on behalf of github Jul 25, 2025

v0.10.0

Highlights

Model Support

Engine Core

Hardwares & Performance

Quantization

API & Frontend

Dependencies

What's Changed

Uh oh!

Uh oh!