[Bug] Qwen3-8B GRPO 郭宣伯 vllm连接不上

### 出bug的具体模型

Qwen3-8B 

### 出bug的具体模型教程

Qwen3-8B GRPO微调及通过swanlab可视化 

### 教程负责人

郭宣伯

### Bug描述

GPU：NVIDIA H20  
Python 3.10.12

Unsloth 2025.6.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.9.1.
Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0

vllm Infinite retry  vllm 无限重试
```
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
[2025-07-15 10:02:46]config.py: PyTorch version 2.7.0 available.
[2025-07-15 10:02:48,786] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-15 10:02:50 [__init__.py:244] Automatically detected platform cuda.
==((====))==  Unsloth 2025.6.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.9.1.
   \\   /|    NVIDIA H20. Num GPUs = 8. Max memory: 95.005 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading /data/team/zhoujie195/models/FastApply-8B-v3.3 with actual GPU utilization = 69.7%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 95.0 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 8096. Num Sequences = 368.
Unsloth: vLLM's KV Cache can use up to 50.8 GB. Also swap space = 6 GB.
INFO 07-15 10:03:06 [config.py:823] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
WARNING 07-15 10:03:06 [config.py:3271] Casting torch.float16 to torch.bfloat16.
INFO 07-15 10:03:06 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8096.
INFO 07-15 10:03:07 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='/data/team/zhoujie195/models/FastApply-8B-v3.3', speculative_config=None, tokenizer='/data/team/zhoujie195/models/FastApply-8B-v3.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/team/zhoujie195/models/FastApply-8B-v3.3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"inductor","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"debug":false,"dce":true,"coordinate_descent_tuning":true,"trace.enabled":false,"trace.graph_diagram":false,"triton.cudagraphs":true,"compile_threads":48,"max_autotune":false,"disable_progress":false,"verbose_progress":true,"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-15 10:03:07 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f97c81e1210>
INFO 07-15 10:03:07 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 07-15 10:03:07 [topk_topp_sampler.py:36] FlashInfer version >= 0.2.3 required. Falling back to default sampling implementation.
INFO 07-15 10:03:07 [gpu_model_runner.py:1595] Starting to load model /data/team/zhoujie195/models/FastApply-8B-v3.3...
INFO 07-15 10:03:08 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 07-15 10:03:08 [cuda.py:227] Using FlashInfer backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:07,  2.46s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.46s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  1.78s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00,  2.02s/it]

INFO 07-15 10:03:16 [default_loader.py:272] Loading weights took 8.14 seconds
INFO 07-15 10:03:16 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 07-15 10:03:16 [gpu_model_runner.py:1624] Model loading took 15.4860 GiB and 8.487722 seconds
INFO 07-15 10:03:29 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/19ef34c7b7/rank_0_0 for vLLM's torch.compile
INFO 07-15 10:03:29 [backends.py:472] Dynamo bytecode transform time: 11.88 s
INFO 07-15 10:03:38 [backends.py:135] Directly load the compiled graph(s) for shape None from the cache, took 8.258 s
INFO 07-15 10:03:41 [monitor.py:34] torch.compile takes 11.88 s in total
INFO 07-15 10:03:42 [gpu_worker.py:227] Available KV cache memory: 48.58 GiB
INFO 07-15 10:03:42 [kv_cache_utils.py:715] GPU KV cache size: 353,760 tokens
INFO 07-15 10:03:42 [kv_cache_utils.py:719] Maximum concurrency for 8,096 tokens per request: 43.70x
INFO 07-15 10:04:22 [gpu_model_runner.py:2048] Graph capturing finished in 40 secs, took 1.00 GiB
INFO 07-15 10:04:23 [core.py:171] init engine (profile, create kv cache, warmup model) took 66.09 seconds
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm']
Unsloth 2025.6.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
[2025-07-15 10:04:31]grpo-demo.py: <|im_start|>system
You are a coding assistant that helps merge code updates, ensuring every modification is fully integrated.<|im_end|>
<|im_start|>user
Merge all changes from the <update> snippet into the <code> below.
- Preserve the code's structure, order, comments, and indentation exactly.
- Output only the updated code, enclosed within <updated-code> and </updated-code> tags.
- Do not include any additional text, explanations, placeholders, ellipses, or code fences.

<code>ALTER TABLE "ModelVersion" ADD COLUMN "uploadType" "ModelUploadType" NOT NULL DEFAULT 'Created';

UPDATE "ModelVersion" mv
SET "uploadType" = m."uploadType"
FROM "Model" m
WHERE m.id = mv."modelId";

-- rerun the above after push
</code>

<update>ALTER TABLE "ModelVersion" ADD COLUMN "isPublic" BOOLEAN NOT NULL DEFAULT false;

UPDATE "ModelVersion" mv
SET "isPublic" = m."isPublic"
FROM "Model" m
WHERE m.id = mv."modelId";

-- Add index on uploadType column
CREATE INDEX idx_modelversion_uploadtype ON "ModelVersion" ("uploadType");</update>

Provide the complete updated code.<|im_end|>
<|im_start|>assistant
<think>

</think>


[2025-07-15 10:04:31]grpo-demo.py: Max Length = 4235
[2025-07-15 10:04:32]other.py: Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[2025-07-15 10:04:32]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
[2025-07-15 10:04:34]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
[2025-07-15 10:04:36]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
[2025-07-15 10:04:38]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
```

详见：
https://github.com/unslothai/unsloth/issues/2962

### 复现步骤

```
import logging
import os
import torch
import time

from util.log import init_log
from reward.merge_code import format_reward,conclusion_reward,matching_reward,hallucination_penalty

OUTPUT_DIR = "output/qwen3-8b-grpo-0714"
DATASET_PATH = 'train.jsonl'
MODEL_PATH = '/llm/Qwen3-8B'
MAX_SEQ_LENGTH = 8096
LORA_RANK = 32

if int(os.environ.get("LOCAL_RANK", 0)) == 0:
    init_log('logs/lr_grpo_{}'.format(time.strftime("%Y-%m-%d", time.localtime())))

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_PATH,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,
    load_in_4bit=False,
    fast_inference=True,
    max_lora_rank=LORA_RANK,
    gpu_memory_utilization=0.7,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=LORA_RANK * 2,
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

from datasets import load_dataset

dataset = load_dataset('json', data_files=DATASET_PATH, split='train[:200]')

import re


def extract_answer(text):
    pattern = r'<updated-code>(.*?)</updated-code>'
    match_format = re.compile(
        pattern,
        flags=re.DOTALL
    )
    guess = match_format.search(text)
    if guess is not None:
        return guess.group(1)
    return None


dataset = dataset.map(lambda x: {
    "prompt": x['messages'][:2],
    "answer": extract_answer(x['messages'][2]["content"]),
},
                      remove_columns="messages")


tokenized = dataset.map(
    lambda x: {"tokens": tokenizer.apply_chat_template(x["prompt"], enable_thinking=False, add_generation_prompt=True,
                                                       tokenize=True)},
    batched=True,
)
logging.info(tokenizer.decode(tokenized[0]["tokens"]))
tokenized = tokenized.map(lambda x: {"L": len(x["tokens"])})

import numpy as np

maximum_length = int(np.quantile(tokenized["L"], 0.9))
logging.info("Max Length = %s" % maximum_length)

dataset = dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])
del tokenized

max_prompt_length = maximum_length + 1
max_completion_length = MAX_SEQ_LENGTH - max_prompt_length

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    temperature=1.0,
    learning_rate=5e-6,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    optim="adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    num_generations=4,
    max_prompt_length=max_prompt_length,
    max_completion_length=max_completion_length,
    num_train_epochs=1,
    save_steps=50,
    save_total_limit=3,
    gradient_checkpointing=True,
    report_to="none",
    output_dir=OUTPUT_DIR,
)

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        format_reward,
        conclusion_reward,
        matching_reward,
        hallucination_penalty,
    ],
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

save_path = OUTPUT_DIR + "/lora_model"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
```

python grpo.py

### 期望行为

尝试了更改vllm的版本，仍然会有同样的报错
vllm==0.8.5.post1、vllm==0.9.1

同时尝试了https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb#scrollTo=KKMyhvM-v0NE 会训练时挂起，停止后有同样的报错logger.info(f"Server is not up yet. Retrying in {retry_interval} seconds...")

### 环境信息

ubuntu 
GPU：NVIDIA H20  
Python 3.10.12

Unsloth 2025.6.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.9.1.
Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0

### 其他信息

见上

### 确认事项 / Verification

- [x] 此问题未在过往Issue中被报告过 / This issue hasn't been reported before

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Qwen3-8B GRPO 郭宣伯 vllm连接不上 #439

出bug的具体模型

出bug的具体模型教程

教程负责人

Bug描述

复现步骤

期望行为

环境信息

其他信息

确认事项 / Verification

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Qwen3-8B GRPO 郭宣伯 vllm连接不上 #439

Description

出bug的具体模型

出bug的具体模型教程

教程负责人

Bug描述

复现步骤

期望行为

环境信息

其他信息

确认事项 / Verification

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions