Skip to content

[Bug] Qwen3-8B GRPO 郭宣伯 vllm连接不上 #439

@zjie19941113

Description

@zjie19941113

出bug的具体模型

Qwen3-8B

出bug的具体模型教程

Qwen3-8B GRPO微调及通过swanlab可视化

教程负责人

郭宣伯

Bug描述

GPU:NVIDIA H20
Python 3.10.12

Unsloth 2025.6.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.9.1.
Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0

vllm Infinite retry vllm 无限重试

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
[2025-07-15 10:02:46]config.py: PyTorch version 2.7.0 available.
[2025-07-15 10:02:48,786] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-15 10:02:50 [__init__.py:244] Automatically detected platform cuda.
==((====))==  Unsloth 2025.6.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.9.1.
   \\   /|    NVIDIA H20. Num GPUs = 8. Max memory: 95.005 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading /data/team/zhoujie195/models/FastApply-8B-v3.3 with actual GPU utilization = 69.7%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 95.0 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 8096. Num Sequences = 368.
Unsloth: vLLM's KV Cache can use up to 50.8 GB. Also swap space = 6 GB.
INFO 07-15 10:03:06 [config.py:823] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
WARNING 07-15 10:03:06 [config.py:3271] Casting torch.float16 to torch.bfloat16.
INFO 07-15 10:03:06 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8096.
INFO 07-15 10:03:07 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='/data/team/zhoujie195/models/FastApply-8B-v3.3', speculative_config=None, tokenizer='/data/team/zhoujie195/models/FastApply-8B-v3.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/team/zhoujie195/models/FastApply-8B-v3.3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"inductor","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"debug":false,"dce":true,"coordinate_descent_tuning":true,"trace.enabled":false,"trace.graph_diagram":false,"triton.cudagraphs":true,"compile_threads":48,"max_autotune":false,"disable_progress":false,"verbose_progress":true,"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-15 10:03:07 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f97c81e1210>
INFO 07-15 10:03:07 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 07-15 10:03:07 [topk_topp_sampler.py:36] FlashInfer version >= 0.2.3 required. Falling back to default sampling implementation.
INFO 07-15 10:03:07 [gpu_model_runner.py:1595] Starting to load model /data/team/zhoujie195/models/FastApply-8B-v3.3...
INFO 07-15 10:03:08 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 07-15 10:03:08 [cuda.py:227] Using FlashInfer backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.46s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.46s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  1.78s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  2.02s/it]

INFO 07-15 10:03:16 [default_loader.py:272] Loading weights took 8.14 seconds
INFO 07-15 10:03:16 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 07-15 10:03:16 [gpu_model_runner.py:1624] Model loading took 15.4860 GiB and 8.487722 seconds
INFO 07-15 10:03:29 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/19ef34c7b7/rank_0_0 for vLLM's torch.compile
INFO 07-15 10:03:29 [backends.py:472] Dynamo bytecode transform time: 11.88 s
INFO 07-15 10:03:38 [backends.py:135] Directly load the compiled graph(s) for shape None from the cache, took 8.258 s
INFO 07-15 10:03:41 [monitor.py:34] torch.compile takes 11.88 s in total
INFO 07-15 10:03:42 [gpu_worker.py:227] Available KV cache memory: 48.58 GiB
INFO 07-15 10:03:42 [kv_cache_utils.py:715] GPU KV cache size: 353,760 tokens
INFO 07-15 10:03:42 [kv_cache_utils.py:719] Maximum concurrency for 8,096 tokens per request: 43.70x
INFO 07-15 10:04:22 [gpu_model_runner.py:2048] Graph capturing finished in 40 secs, took 1.00 GiB
INFO 07-15 10:04:23 [core.py:171] init engine (profile, create kv cache, warmup model) took 66.09 seconds
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm']
Unsloth 2025.6.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
[2025-07-15 10:04:31]grpo-demo.py: <|im_start|>system
You are a coding assistant that helps merge code updates, ensuring every modification is fully integrated.<|im_end|>
<|im_start|>user
Merge all changes from the <update> snippet into the <code> below.
- Preserve the code's structure, order, comments, and indentation exactly.
- Output only the updated code, enclosed within <updated-code> and </updated-code> tags.
- Do not include any additional text, explanations, placeholders, ellipses, or code fences.

<code>ALTER TABLE "ModelVersion" ADD COLUMN "uploadType" "ModelUploadType" NOT NULL DEFAULT 'Created';

UPDATE "ModelVersion" mv
SET "uploadType" = m."uploadType"
FROM "Model" m
WHERE m.id = mv."modelId";

-- rerun the above after push
</code>

<update>ALTER TABLE "ModelVersion" ADD COLUMN "isPublic" BOOLEAN NOT NULL DEFAULT false;

UPDATE "ModelVersion" mv
SET "isPublic" = m."isPublic"
FROM "Model" m
WHERE m.id = mv."modelId";

-- Add index on uploadType column
CREATE INDEX idx_modelversion_uploadtype ON "ModelVersion" ("uploadType");</update>

Provide the complete updated code.<|im_end|>
<|im_start|>assistant
<think>

</think>


[2025-07-15 10:04:31]grpo-demo.py: Max Length = 4235
[2025-07-15 10:04:32]other.py: Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[2025-07-15 10:04:32]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
[2025-07-15 10:04:34]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
[2025-07-15 10:04:36]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
[2025-07-15 10:04:38]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...

详见:
unslothai/unsloth#2962

复现步骤

import logging
import os
import torch
import time

from util.log import init_log
from reward.merge_code import format_reward,conclusion_reward,matching_reward,hallucination_penalty

OUTPUT_DIR = "output/qwen3-8b-grpo-0714"
DATASET_PATH = 'train.jsonl'
MODEL_PATH = '/llm/Qwen3-8B'
MAX_SEQ_LENGTH = 8096
LORA_RANK = 32

if int(os.environ.get("LOCAL_RANK", 0)) == 0:
    init_log('logs/lr_grpo_{}'.format(time.strftime("%Y-%m-%d", time.localtime())))

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_PATH,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,
    load_in_4bit=False,
    fast_inference=True,
    max_lora_rank=LORA_RANK,
    gpu_memory_utilization=0.7,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=LORA_RANK * 2,
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

from datasets import load_dataset

dataset = load_dataset('json', data_files=DATASET_PATH, split='train[:200]')

import re


def extract_answer(text):
    pattern = r'<updated-code>(.*?)</updated-code>'
    match_format = re.compile(
        pattern,
        flags=re.DOTALL
    )
    guess = match_format.search(text)
    if guess is not None:
        return guess.group(1)
    return None


dataset = dataset.map(lambda x: {
    "prompt": x['messages'][:2],
    "answer": extract_answer(x['messages'][2]["content"]),
},
                      remove_columns="messages")


tokenized = dataset.map(
    lambda x: {"tokens": tokenizer.apply_chat_template(x["prompt"], enable_thinking=False, add_generation_prompt=True,
                                                       tokenize=True)},
    batched=True,
)
logging.info(tokenizer.decode(tokenized[0]["tokens"]))
tokenized = tokenized.map(lambda x: {"L": len(x["tokens"])})

import numpy as np

maximum_length = int(np.quantile(tokenized["L"], 0.9))
logging.info("Max Length = %s" % maximum_length)

dataset = dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])
del tokenized

max_prompt_length = maximum_length + 1
max_completion_length = MAX_SEQ_LENGTH - max_prompt_length

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    temperature=1.0,
    learning_rate=5e-6,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    optim="adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    num_generations=4,
    max_prompt_length=max_prompt_length,
    max_completion_length=max_completion_length,
    num_train_epochs=1,
    save_steps=50,
    save_total_limit=3,
    gradient_checkpointing=True,
    report_to="none",
    output_dir=OUTPUT_DIR,
)

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        format_reward,
        conclusion_reward,
        matching_reward,
        hallucination_penalty,
    ],
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

save_path = OUTPUT_DIR + "/lora_model"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

python grpo.py

期望行为

尝试了更改vllm的版本,仍然会有同样的报错
vllm==0.8.5.post1、vllm==0.9.1

同时尝试了https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb#scrollTo=KKMyhvM-v0NE 会训练时挂起,停止后有同样的报错logger.info(f"Server is not up yet. Retrying in {retry_interval} seconds...")

环境信息

ubuntu
GPU:NVIDIA H20
Python 3.10.12

Unsloth 2025.6.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.9.1.
Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0

其他信息

见上

确认事项 / Verification

  • 此问题未在过往Issue中被报告过 / This issue hasn't been reported before

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions