Skip to content

feat: support llmaz model #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 13, 2025
Merged

feat: support llmaz model #2

merged 12 commits into from
Jun 13, 2025

Conversation

carlory
Copy link
Member

@carlory carlory commented Jun 3, 2025

Part of InftyAI/llmaz#106

Description

How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@InftyAI-Agent InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. do-not-merge/needs-kind Indicates a PR lacks a label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 3, 2025
@InftyAI-Agent InftyAI-Agent requested a review from kerthcet June 3, 2025 10:24
@carlory carlory force-pushed the llmaz branch 5 times, most recently from 4cdd9fa to 5b1bd49 Compare June 3, 2025 10:52
@kerthcet
Copy link
Member

kerthcet commented Jun 3, 2025

Please fix the CI first.

@kerthcet
Copy link
Member

kerthcet commented Jun 3, 2025

Better to draft a KEP as discussed first which help people to understand the PR better.

@carlory
Copy link
Member Author

carlory commented Jun 3, 2025

Better to draft a KEP as discussed first which help people to understand the PR better.

I will do it tomorrow.

@carlory carlory changed the title feat: support llmaz model [WIP] feat: support llmaz model Jun 3, 2025
@carlory carlory force-pushed the llmaz branch 4 times, most recently from 7744f4e to e55f1bf Compare June 4, 2025 05:53
@coveralls
Copy link

coveralls commented Jun 4, 2025

Pull Request Test Coverage Report for Build 15581344530

Details

  • 132 of 146 (90.41%) changed or added relevant lines in 4 files are covered.
  • 9 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.09%) to 82.091%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/operator/operator.go 0 1 0.0%
pkg/test/openmodel.go 12 13 92.31%
pkg/controllers/provisioning/provisioner.go 12 18 66.67%
pkg/controllers/provisioning/scheduling/modelinference.go 108 114 94.74%
Files with Coverage Reduction New Missed Lines %
pkg/test/expectations/expectations.go 2 94.67%
pkg/controllers/provisioning/scheduling/nodeclaim.go 3 89.66%
pkg/controllers/provisioning/scheduling/topologydomaingroup.go 4 86.21%
Totals Coverage Status
Change from base Build 15407776082: 0.09%
Covered Lines: 10387
Relevant Lines: 12653

💛 - Coveralls

@carlory carlory changed the title [WIP] feat: support llmaz model feat: support llmaz model Jun 6, 2025
for _, flavorNameVal := range strings.Split(serviceFlavorRawStr, ",") {
flavor, ok := modelFlavorMap[llmazcoreapi.FlavorName(flavorNameVal)]
if !ok {
return nil, fmt.Errorf("unknown service inference flavor %q", flavorNameVal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just ignore the error here, it won't be destructive to our system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if all flavors of the pod were removed from the target model but the model has other flavors? The pod fails to be scheduled or the pod is scheduled to unexpected flavor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's be more strict here, we can refactor this part later if needed.

}

podCopy := pod.DeepCopy()
pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will populate it with the copied pod and model flavors.

return flavor.Name, flavor
},
)
unknownFlavors := lo.Reject(strings.Split(serviceFlavorRawStr, ","), func(flavor string, _ int) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I think we can tolerate the error here. Generally this should be validated in creation in llmaz system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, users can remove or add an inference flavor from a model after a playground is created. Do you mean we should not allow this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason as #2 (comment). Once the scheduler supports it, I will change this PR.

node := ExpectScheduled(ctx, env.Client, pod)
Expect(node.Labels).To(HaveKeyWithValue(corev1.LabelInstanceTypeStable, "gpu-vendor-b-instance-type"))
})
It("shouldn't schedule to the in-flight node claim even if the node claim is compatible with second inference flavor when the first inference flavor is supported by node pools", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more here, I didn't quite get what you mean here. Thanks.

Copy link
Member Author

@carlory carlory Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we have 2 pending pods and 1 model with 2 flavors. And the existing node pool can support both flavors. The first pod is associated with the second flavor via the inference service flavors annotation. The provisioning controller will try to schedule 2 pending pods in the same scheduling loop. A node claim (in-flight) will be created for the first pod. After placing the first pod, the remaining resources are enough to schedule the second pod. However, at the same time, the existing node pool can also schedule the second pod with the first flavor. The second node claim will be created. It won't assign the pod to the previous node claim unless the existing node pools do not meet the requirements of the first flavor.

@kerthcet
Copy link
Member

kindly ping

@carlory carlory force-pushed the llmaz branch 2 times, most recently from ca2fdf1 to ead4b79 Compare June 10, 2025 03:28
@InftyAI-Agent InftyAI-Agent added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Looks good to me, indicates that a PR is ready to be merged. labels Jun 10, 2025
)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@InftyAI-Agent InftyAI-Agent removed the lgtm Looks good to me, indicates that a PR is ready to be merged. label Jun 11, 2025
@carlory carlory force-pushed the llmaz branch 2 times, most recently from 0e2239d to 72417b3 Compare June 11, 2025 09:34
@carlory
Copy link
Member Author

carlory commented Jun 11, 2025

@kerthcet add ci support, please review it again.

- name: Show pushed image
run: |
echo "✅ Image pushed to:"
echo "docker.io/carlory/aws-karpenter/controller:${{ steps.tag.outputs.image_tag }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to change this to inftyai/karpenter-provider-aws, see https://hub.docker.com/repository/docker/inftyai/karpenter-provider-aws/general

- name: Login to DockerHub
uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 #v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The secrets are added.

runs-on: ubuntu-latest

env:
KO_DOCKER_REPO: docker.io/inftyai/aws-karpenter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change to karpenter-aws, incase we'll have karpenter-provider-alibaba

carlory and others added 2 commits June 13, 2025 17:40
Co-authored-by: Kante Yin <[email protected]>
Signed-off-by: carlory <[email protected]>
Signed-off-by: carlory <[email protected]>
@carlory
Copy link
Member Author

carlory commented Jun 13, 2025

@kerthcet rebased with upstream changes to fix ci error.

@carlory
Copy link
Member Author

carlory commented Jun 13, 2025

/hold cancel

@InftyAI-Agent InftyAI-Agent removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2025
@kerthcet
Copy link
Member

/approve
/lgtm
/kind feature

@kerthcet
Copy link
Member

Let's have a test.

@InftyAI-Agent InftyAI-Agent added lgtm Looks good to me, indicates that a PR is ready to be merged. feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels Jun 13, 2025
@InftyAI-Agent InftyAI-Agent merged commit 2d90e01 into InftyAI:main Jun 13, 2025
36 of 37 checks passed
@carlory carlory deleted the llmaz branch June 13, 2025 10:43
@carlory carlory mentioned this pull request Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. feature Categorizes issue or PR as related to a new feature. lgtm Looks good to me, indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a label and requires one. needs-triage Indicates an issue or PR lacks a label and requires one.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants