-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
出bug的具体模型
Qwen3-8B
出bug的具体模型教程
Qwen3-8B GRPO微调及通过swanlab可视化
教程负责人
郭宣伯
Bug描述
GPU:NVIDIA H20
Python 3.10.12
Unsloth 2025.6.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.9.1.
Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
vllm Infinite retry vllm 无限重试
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
[2025-07-15 10:02:46]config.py: PyTorch version 2.7.0 available.
[2025-07-15 10:02:48,786] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-15 10:02:50 [__init__.py:244] Automatically detected platform cuda.
==((====))== Unsloth 2025.6.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.9.1.
\\ /| NVIDIA H20. Num GPUs = 8. Max memory: 95.005 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
\ / Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading /data/team/zhoujie195/models/FastApply-8B-v3.3 with actual GPU utilization = 69.7%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 95.0 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 8096. Num Sequences = 368.
Unsloth: vLLM's KV Cache can use up to 50.8 GB. Also swap space = 6 GB.
INFO 07-15 10:03:06 [config.py:823] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
WARNING 07-15 10:03:06 [config.py:3271] Casting torch.float16 to torch.bfloat16.
INFO 07-15 10:03:06 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=8096.
INFO 07-15 10:03:07 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='/data/team/zhoujie195/models/FastApply-8B-v3.3', speculative_config=None, tokenizer='/data/team/zhoujie195/models/FastApply-8B-v3.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/team/zhoujie195/models/FastApply-8B-v3.3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"inductor","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"debug":false,"dce":true,"coordinate_descent_tuning":true,"trace.enabled":false,"trace.graph_diagram":false,"triton.cudagraphs":true,"compile_threads":48,"max_autotune":false,"disable_progress":false,"verbose_progress":true,"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-15 10:03:07 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f97c81e1210>
INFO 07-15 10:03:07 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 07-15 10:03:07 [topk_topp_sampler.py:36] FlashInfer version >= 0.2.3 required. Falling back to default sampling implementation.
INFO 07-15 10:03:07 [gpu_model_runner.py:1595] Starting to load model /data/team/zhoujie195/models/FastApply-8B-v3.3...
INFO 07-15 10:03:08 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 07-15 10:03:08 [cuda.py:227] Using FlashInfer backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:02<00:07, 2.46s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:04<00:04, 2.46s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:07<00:02, 2.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00, 1.78s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:08<00:00, 2.02s/it]
INFO 07-15 10:03:16 [default_loader.py:272] Loading weights took 8.14 seconds
INFO 07-15 10:03:16 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 07-15 10:03:16 [gpu_model_runner.py:1624] Model loading took 15.4860 GiB and 8.487722 seconds
INFO 07-15 10:03:29 [backends.py:462] Using cache directory: /root/.cache/vllm/torch_compile_cache/19ef34c7b7/rank_0_0 for vLLM's torch.compile
INFO 07-15 10:03:29 [backends.py:472] Dynamo bytecode transform time: 11.88 s
INFO 07-15 10:03:38 [backends.py:135] Directly load the compiled graph(s) for shape None from the cache, took 8.258 s
INFO 07-15 10:03:41 [monitor.py:34] torch.compile takes 11.88 s in total
INFO 07-15 10:03:42 [gpu_worker.py:227] Available KV cache memory: 48.58 GiB
INFO 07-15 10:03:42 [kv_cache_utils.py:715] GPU KV cache size: 353,760 tokens
INFO 07-15 10:03:42 [kv_cache_utils.py:719] Maximum concurrency for 8,096 tokens per request: 43.70x
INFO 07-15 10:04:22 [gpu_model_runner.py:2048] Graph capturing finished in 40 secs, took 1.00 GiB
INFO 07-15 10:04:23 [core.py:171] init engine (profile, create kv cache, warmup model) took 66.09 seconds
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm']
Unsloth 2025.6.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
[2025-07-15 10:04:31]grpo-demo.py: <|im_start|>system
You are a coding assistant that helps merge code updates, ensuring every modification is fully integrated.<|im_end|>
<|im_start|>user
Merge all changes from the <update> snippet into the <code> below.
- Preserve the code's structure, order, comments, and indentation exactly.
- Output only the updated code, enclosed within <updated-code> and </updated-code> tags.
- Do not include any additional text, explanations, placeholders, ellipses, or code fences.
<code>ALTER TABLE "ModelVersion" ADD COLUMN "uploadType" "ModelUploadType" NOT NULL DEFAULT 'Created';
UPDATE "ModelVersion" mv
SET "uploadType" = m."uploadType"
FROM "Model" m
WHERE m.id = mv."modelId";
-- rerun the above after push
</code>
<update>ALTER TABLE "ModelVersion" ADD COLUMN "isPublic" BOOLEAN NOT NULL DEFAULT false;
UPDATE "ModelVersion" mv
SET "isPublic" = m."isPublic"
FROM "Model" m
WHERE m.id = mv."modelId";
-- Add index on uploadType column
CREATE INDEX idx_modelversion_uploadtype ON "ModelVersion" ("uploadType");</update>
Provide the complete updated code.<|im_end|>
<|im_start|>assistant
<think>
</think>
[2025-07-15 10:04:31]grpo-demo.py: Max Length = 4235
[2025-07-15 10:04:32]other.py: Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[2025-07-15 10:04:32]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
[2025-07-15 10:04:34]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
[2025-07-15 10:04:36]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
[2025-07-15 10:04:38]vllm_client.py: Server is not up yet. Retrying in 2.0 seconds...
复现步骤
import logging
import os
import torch
import time
from util.log import init_log
from reward.merge_code import format_reward,conclusion_reward,matching_reward,hallucination_penalty
OUTPUT_DIR = "output/qwen3-8b-grpo-0714"
DATASET_PATH = 'train.jsonl'
MODEL_PATH = '/llm/Qwen3-8B'
MAX_SEQ_LENGTH = 8096
LORA_RANK = 32
if int(os.environ.get("LOCAL_RANK", 0)) == 0:
init_log('logs/lr_grpo_{}'.format(time.strftime("%Y-%m-%d", time.localtime())))
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_PATH,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None,
load_in_4bit=False,
fast_inference=True,
max_lora_rank=LORA_RANK,
gpu_memory_utilization=0.7,
)
model = FastLanguageModel.get_peft_model(
model,
r=LORA_RANK,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=LORA_RANK * 2,
use_gradient_checkpointing="unsloth",
random_state=3407,
)
from datasets import load_dataset
dataset = load_dataset('json', data_files=DATASET_PATH, split='train[:200]')
import re
def extract_answer(text):
pattern = r'<updated-code>(.*?)</updated-code>'
match_format = re.compile(
pattern,
flags=re.DOTALL
)
guess = match_format.search(text)
if guess is not None:
return guess.group(1)
return None
dataset = dataset.map(lambda x: {
"prompt": x['messages'][:2],
"answer": extract_answer(x['messages'][2]["content"]),
},
remove_columns="messages")
tokenized = dataset.map(
lambda x: {"tokens": tokenizer.apply_chat_template(x["prompt"], enable_thinking=False, add_generation_prompt=True,
tokenize=True)},
batched=True,
)
logging.info(tokenizer.decode(tokenized[0]["tokens"]))
tokenized = tokenized.map(lambda x: {"L": len(x["tokens"])})
import numpy as np
maximum_length = int(np.quantile(tokenized["L"], 0.9))
logging.info("Max Length = %s" % maximum_length)
dataset = dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])
del tokenized
max_prompt_length = maximum_length + 1
max_completion_length = MAX_SEQ_LENGTH - max_prompt_length
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
temperature=1.0,
learning_rate=5e-6,
weight_decay=0.01,
warmup_ratio=0.1,
lr_scheduler_type="linear",
optim="adamw_8bit",
logging_steps=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=1,
num_generations=4,
max_prompt_length=max_prompt_length,
max_completion_length=max_completion_length,
num_train_epochs=1,
save_steps=50,
save_total_limit=3,
gradient_checkpointing=True,
report_to="none",
output_dir=OUTPUT_DIR,
)
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
format_reward,
conclusion_reward,
matching_reward,
hallucination_penalty,
],
args=training_args,
train_dataset=dataset,
)
trainer.train()
save_path = OUTPUT_DIR + "/lora_model"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
python grpo.py
期望行为
尝试了更改vllm的版本,仍然会有同样的报错
vllm==0.8.5.post1、vllm==0.9.1
同时尝试了https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb#scrollTo=KKMyhvM-v0NE 会训练时挂起,停止后有同样的报错logger.info(f"Server is not up yet. Retrying in {retry_interval} seconds...")
环境信息
ubuntu
GPU:NVIDIA H20
Python 3.10.12
Unsloth 2025.6.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.9.1.
Torch: 2.7.0+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.0
其他信息
见上
确认事项 / Verification
- 此问题未在过往Issue中被报告过 / This issue hasn't been reported before
Twosugar666
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working