Skip to content

Rename resources to be model server generic instead of referencing vLLM #676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BenjaminBraunDev
Copy link
Contributor

Renames the InferencePool, Services, and Deployments to not have vLLM in the name in preparation for adding Triton support.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BenjaminBraunDev
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 10, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @BenjaminBraunDev. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested review from Jeffwan and kfswain April 10, 2025 17:15
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 10, 2025
Copy link

netlify bot commented Apr 10, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit f2a9b34
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67f7fca5d1845600082eda09
😎 Deploy Preview https://deploy-preview-676--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@nirrozenbaum
Copy link
Contributor

/ok-to-test
@BenjaminBraunDev you probably need to change also in tests

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 10, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Apr 10, 2025

@nirrozenbaum @kfswain The intent behind this change is to introduce triton as another option for deploying the model server. This is kinda the same naming challenge we are having with the cpu-based deployment.

Any thoughts on how we want to do this? do we want to continue to have the inference pool model and model server specific? in which case we will need to maintain a yaml for each flavor, and that will also require duplication in the vendor specific manifests.

@nirrozenbaum
Copy link
Contributor

@nirrozenbaum @kfswain The intent behind this change is to introduce triton as another option for deploying the model server. This is kinda the same naming challenge we are having with the cpu-based deployment.

Any thoughts on how we want to do this? do we want to continue to have the inference pool model and model server specific? in which case we will need to maintain a yaml for each flavor, and that will also require duplication in the vendor specific manifests.

@ahg-g we can take the discussion in an issue out of the PR not to lose track. writing brain dump of few thoughts I had about this topic:

  • we’re working with a single Inf Pool. maybe the label selector for selecting pods can be a general label, similar to how prometheus has “scrape: true”, we can have a label on pods “inference-routing: true” (or something along these lines).
  • the service and deployment names can also become general, e.g., inference-routing-svc or something like that.
  • if one wants to deploy multiple pools we can do that in multiple namespaces. but as of now this is not relevant.
  • having general label like “inference-routing: true” allows mixing model servers from different vendors in the same pool, which I don’t know if needed but sounds pretty nice.

@kfswain
Copy link
Collaborator

kfswain commented Apr 10, 2025

Any thoughts on how we want to do this?

We could make this configurable in the helm chart? And for the non-configurable guide, I think it's okay to leave vllm for now? It's less about a specific vendor, and more just helping portray the idea of what an inferencePool should shape.

Or, in theory, we could have our GPU deployment actually just be multiple model servers with the same labels, that could be useful to show the heterogeneity of a pool. Potentially similar to what Nir is thinking, but without locking down a specific label name (We should allow a user to have multiple pools in the same namespace if desired, personally no reason to impose that restriction imo).

@nirrozenbaum
Copy link
Contributor

yes, I agree there is no need to restrict one inf pool per ns. it was just a brain dump trying to think out loud.
another point - maybe it would be easier if inf pool selects inf model (by labels) and not the other way around where inf model specifies the pool ref. it’s also not so intuitive and it uses pool name which makes us use name instead of labels selection.

@BenjaminBraunDev
Copy link
Contributor Author

BenjaminBraunDev commented Apr 10, 2025

My thinking was for the time being we could have model server branches meet at the InferencePool level defined in inferencepool-resources.yaml, and switch out that app label to target whichever model server deployment the customer is using (vllm, triton, etc.)

The one other thing is that they will have to switch out the metric flags in the EPP deployment for different model servers, as Triton and vLLM have different names, so both those have to change in the inferencepool-resources.yaml

Other options include having separate InferencePools or EPPs for the different model servers.

@kfswain
Copy link
Collaborator

kfswain commented Apr 10, 2025

yes, I agree there is no need to restrict one inf pool per ns. it was just a brain dump trying to think out loud.
another point - maybe it would be easier if inf pool selects inf model (by labels) and not the other way around where inf model specifies the pool ref. it’s also not so intuitive and it uses pool name which makes us use name instead of labels selection.

Yeah NP. That's an interesting idea, and would definitely work as long as we keep the InferencePool as the backend ref. The InferencePool would then have 2 selectors. But i could see that being an option

@kfswain
Copy link
Collaborator

kfswain commented Apr 10, 2025

The one other thing is that they will have to switch out the metric flags in the EPP deployment for different model servers, as Triton and vLLM have different names, so both those have to change in the inferencepool-resources.yaml

These?

I was hoping with Triton we would have found a way to merge the two. That was kind of the idea behind the Model Server Protocol, is that an intractable problem due to prometheus naming? Ideally we should get to a state where the EPP does not care about the underlying serving mechanism and serve hetero-model servers

@liu-cong
Copy link
Contributor

Having a heterogeneous pool is cool, but we are not there yet. Currently there are configurations such as the scheduler config that are likely specific to the model server deployment. I would argue for not mixing different model servers in the user guides. I prefer to have different label selectors for vllm and Triton.

BTW https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/537/files Adds the helm template for Triton deployment. We can modify the charts to update the InferencePool label selector if Triton is specified.

This PR is reasonable to me, if we also update the helm charts

@ahg-g
Copy link
Contributor

ahg-g commented Apr 10, 2025

Thanks all for the feedback, this is more about how we want to organize the examples in this repo.

We have three examples:

  • vllm on gpu running llama3
  • vllm on cpu running qwen
  • triton on gpu running llama3

IMO we should have all three examples in the repo, if we agree to that, then the question is how do we want to organize the manifests to accommodate all three.

Option 1: make the inference-pool name generic (my-inference-pool) so that the provider-specific gateway manifests are the same across all examples. We also remove inferencepool-resources.yaml file and in the guide we rely on the helm chart to create the InferencePool and customize it to select the right model server deployment.

Option 2: continue to be specific in the InferencePool name which means we need to create a separate set of provider-specific gateway manifests for each example.
Option 2.1: create a separate set of manifests explicitly
Option 2.2: create a helm chart to deploy the gateway, and this helm chart will have a provider flag and inferencePool name flag.

I am in favor of option 2.2

@nirrozenbaum
Copy link
Contributor

IMO this discussion hides an even more important question - what is the mental model of GIE InferencePool?

if the mental model and what we want people to interpret when they see the code/yamls is that InferencePool is per base model - then it makes sense to use the base model name in the resources.

on the other hand, if the pool can be used for multiple base model, it doesn’t really make sense to use base model name in the resource name and could be very confusing.

when I first read GIE documentation I’ve seen option 2, but manifests look like option 1.

@ahg-g
Copy link
Contributor

ahg-g commented Apr 11, 2025

As of now, the EPP implementation assumes an InferencePool serves a single base model, multi-tenancy comes from adapters. The API doesn't enforce that though. I think we should clarify that in the docs and align the examples with what is currently supported.

@nirrozenbaum
Copy link
Contributor

nirrozenbaum commented Apr 11, 2025

yes. if we assume a single base model per pool, that should be documented clearly.
option 1 from the ones you suggested is not aligned with the mental model.
I’m in favor of option 2.1, create different set of manifests for each. that’s a very common pattern to see “examples/samples” dir, with different manifests for each sample.
it also should be easier to test e2e new manifest this way as the tests can get path to manifest file.

@ahg-g
Copy link
Contributor

ahg-g commented Apr 11, 2025

My concern is keeping them in sync, this will lead to a fair amount of duplication.

@liu-cong
Copy link
Contributor

I don't have a strong opinion, but I am leaning towards 2.2 : Maintain a simple, more static example in plain manifests, while moving more dynamic configurations to helm. If we agree on this simple principle, it can save a lot of discussions like this in the future.

@kfswain
Copy link
Collaborator

kfswain commented Apr 11, 2025

I am in favor of option 2.2

++

If we can't make heterogeneous pools, I think it's actually more important to name the pools to reflect the mono-model server nature of them, and doing so in a helm chart seems the most maintainable.

I think @nirrozenbaum brings up reasonable points but I was gonna say:

My concern is keeping them in sync, this will lead to a fair amount of duplication.

I worry this would set a toil-heavy pattern.

@BenjaminBraunDev
Copy link
Contributor Author

The one other thing is that they will have to switch out the metric flags in the EPP deployment for different model servers, as Triton and vLLM have different names, so both those have to change in the inferencepool-resources.yaml

These?

I was hoping with Triton we would have found a way to merge the two. That was kind of the idea behind the Model Server Protocol, is that an intractable problem due to prometheus naming? Ideally we should get to a state where the EPP does not care about the underlying serving mechanism and serve hetero-model servers

This was something we discussed in depth about in terms of what to expect form other model servers, but at the end of the day servers like triton have hundreds of prometheus metrics that all have a consistent naming convention separate from vLLM (the names Gateway defaults to). What we did do is add new metrics to TensorRT-LLM backend that weren't there before to avoid having to combine multiple metrics into single metrics on Gateway's side.

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2025
@danehans
Copy link
Contributor

This PR updates the naming for the gke gateway. For consistency, all presently supported gateways should be updated to match.

What we did do is add new metrics to TensorRT-LLM backend that weren't there before to avoid having to combine multiple metrics into single metrics on Gateway's side.

I see this PR added support for the required metrics defined by the model server protocol. However, I can't tell if Triton supports the LoRA adapter section of the model server protocol. I wouldn't think so due to:

Note the current algorithm in the reference EPP is highly biased towards vLLM's current dynamic LoRA implementation.

If this is the case, how can Triton be included if it does not support the requirements stated in the model server protocol?

@danehans
Copy link
Contributor

xref #710

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants