-
Notifications
You must be signed in to change notification settings - Fork 1
feat: support llmaz model #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4cdd9fa
to
5b1bd49
Compare
Please fix the CI first. |
Better to draft a KEP as discussed first which help people to understand the PR better. |
I will do it tomorrow. |
7744f4e
to
e55f1bf
Compare
Pull Request Test Coverage Report for Build 15581344530Details
💛 - Coveralls |
for _, flavorNameVal := range strings.Split(serviceFlavorRawStr, ",") { | ||
flavor, ok := modelFlavorMap[llmazcoreapi.FlavorName(flavorNameVal)] | ||
if !ok { | ||
return nil, fmt.Errorf("unknown service inference flavor %q", flavorNameVal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just ignore the error here, it won't be destructive to our system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It aligns with https://github.com/InftyAI/scheduler-plugins/blob/main/pkg/plugins/resource_fungibility/resource_fungibility.go#L168. Do you want to change it as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if all flavors of the pod were removed from the target model but the model has other flavors? The pod fails to be scheduled or the pod is scheduled to unexpected flavor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's be more strict here, we can refactor this part later if needed.
} | ||
|
||
podCopy := pod.DeepCopy() | ||
pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will populate it with the copied pod and model flavors.
return flavor.Name, flavor | ||
}, | ||
) | ||
unknownFlavors := lo.Reject(strings.Split(serviceFlavorRawStr, ","), func(flavor string, _ int) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I think we can tolerate the error here. Generally this should be validated in creation in llmaz system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, users can remove or add an inference flavor from a model after a playground is created. Do you mean we should not allow this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same reason as #2 (comment). Once the scheduler supports it, I will change this PR.
node := ExpectScheduled(ctx, env.Client, pod) | ||
Expect(node.Labels).To(HaveKeyWithValue(corev1.LabelInstanceTypeStable, "gpu-vendor-b-instance-type")) | ||
}) | ||
It("shouldn't schedule to the in-flight node claim even if the node claim is compatible with second inference flavor when the first inference flavor is supported by node pools", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more here, I didn't quite get what you mean here. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say we have 2 pending pods and 1 model with 2 flavors. And the existing node pool can support both flavors. The first pod is associated with the second flavor via the inference service flavors annotation. The provisioning controller will try to schedule 2 pending pods in the same scheduling loop. A node claim (in-flight) will be created for the first pod. After placing the first pod, the remaining resources are enough to schedule the second pod. However, at the same time, the existing node pool can also schedule the second pod with the first flavor. The second node claim will be created. It won't assign the pod to the previous node claim unless the existing node pools do not meet the requirements of the first flavor.
kindly ping |
ca2fdf1
to
ead4b79
Compare
) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
0e2239d
to
72417b3
Compare
@kerthcet add ci support, please review it again. |
.github/workflows/aws.yaml
Outdated
- name: Show pushed image | ||
run: | | ||
echo "✅ Image pushed to:" | ||
echo "docker.io/carlory/aws-karpenter/controller:${{ steps.tag.outputs.image_tag }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to change this to inftyai/karpenter-provider-aws
, see https://hub.docker.com/repository/docker/inftyai/karpenter-provider-aws/general
- name: Login to DockerHub | ||
uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 #v3 | ||
with: | ||
username: ${{ secrets.DOCKERHUB_USERNAME }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The secrets are added.
.github/workflows/aws.yaml
Outdated
runs-on: ubuntu-latest | ||
|
||
env: | ||
KO_DOCKER_REPO: docker.io/inftyai/aws-karpenter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change to karpenter-aws, incase we'll have karpenter-provider-alibaba
Co-authored-by: Kante Yin <[email protected]> Signed-off-by: carlory <[email protected]>
Signed-off-by: carlory <[email protected]>
@kerthcet rebased with upstream changes to fix ci error. |
/hold cancel |
/approve |
Let's have a test. |
Part of InftyAI/llmaz#106
Description
How was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.