-
Notifications
You must be signed in to change notification settings - Fork 147
Description
Describe the bug
Following: https://kaito-project.github.io/kaito/docs/multi-node-inference/#basic-multi-node-setup
The current deployment file used for Workspace to increase the count from a default of 1 to 2 is not working. Added attribute "count 2" (see below)
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
name: workspace-gpt-oss-vllm-nc-a100
namespace: openai
resource:
count: 2
instanceType: "Standard_NC24ads_A100_v4"
labelSelector:
matchLabels:
app: gpt-oss-120b-vllm
Once added another issue developed about "admission webhook validation.workspace.kaito.sh denied the request: validation failed: missing fields: max-model-len is required in the vllm section of the inference_config.yaml when using multi-GPU instances with <20GB of memory per GPU or distributed inference
Added attribute max-model-len: 4096 and it same error happens
inference:
template:
spec:
containers:
- name: vllm-openai
image:
imagePullPolicy: IfNotPresent
args:
- --model
- openai/gpt-oss-120b
- --swap-space
- "4"
- --gpu-memory-utilization
- "0.95"
- --port
- "5000"
- --max-model-len
- "4096"
ports:
- name: http
containerPort: 5000
resources:
limits:
nvidia.com/gpu: 1
cpu: "24"
memory: "220Gi"
requests:
nvidia.com/gpu: 1
cpu: "12"
memory: "110Gi"
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 600
periodSeconds: 1800
env: #special configs for A10 gpu
- name: VLLM_ATTENTION_BACKEND
value: "TRITON_ATTN_VLLM_V1"
- name: VLLM_DISABLE_SINKS
value: "1"
Steps To Reproduce
Add attribute count: 2
Add attribute max-model-len: "4096"
Expected behavior
The number of workspaces should go to 2
Logs
Environment
AKS
- Kubernetes version (use
kubectl version): 1.33.6 - OS (e.g:
cat /etc/os-release): - Install tools:
- Others:
Additional context
Metadata
Metadata
Assignees
Labels
Type
Projects
Status