-
Notifications
You must be signed in to change notification settings - Fork 65
Rename resources to be model server generic instead of referencing vLLM #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: BenjaminBraunDev The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @BenjaminBraunDev. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/ok-to-test |
@nirrozenbaum @kfswain The intent behind this change is to introduce triton as another option for deploying the model server. This is kinda the same naming challenge we are having with the cpu-based deployment. Any thoughts on how we want to do this? do we want to continue to have the inference pool model and model server specific? in which case we will need to maintain a yaml for each flavor, and that will also require duplication in the vendor specific manifests. |
@ahg-g we can take the discussion in an issue out of the PR not to lose track. writing brain dump of few thoughts I had about this topic:
|
We could make this configurable in the helm chart? And for the non-configurable guide, I think it's okay to leave vllm for now? It's less about a specific vendor, and more just helping portray the idea of what an inferencePool should shape. Or, in theory, we could have our GPU deployment actually just be multiple model servers with the same labels, that could be useful to show the heterogeneity of a pool. Potentially similar to what Nir is thinking, but without locking down a specific label name (We should allow a user to have multiple pools in the same namespace if desired, personally no reason to impose that restriction imo). |
yes, I agree there is no need to restrict one inf pool per ns. it was just a brain dump trying to think out loud. |
My thinking was for the time being we could have model server branches meet at the InferencePool level defined in The one other thing is that they will have to switch out the metric flags in the EPP deployment for different model servers, as Triton and vLLM have different names, so both those have to change in the Other options include having separate InferencePools or EPPs for the different model servers. |
Yeah NP. That's an interesting idea, and would definitely work as long as we keep the InferencePool as the backend ref. The InferencePool would then have 2 selectors. But i could see that being an option |
These?
I was hoping with Triton we would have found a way to merge the two. That was kind of the idea behind the Model Server Protocol, is that an intractable problem due to prometheus naming? Ideally we should get to a state where the EPP does not care about the underlying serving mechanism and serve hetero-model servers |
Having a heterogeneous pool is cool, but we are not there yet. Currently there are configurations such as the scheduler config that are likely specific to the model server deployment. I would argue for not mixing different model servers in the user guides. I prefer to have different label selectors for vllm and Triton. BTW https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/537/files Adds the helm template for Triton deployment. We can modify the charts to update the InferencePool label selector if Triton is specified. This PR is reasonable to me, if we also update the helm charts |
Thanks all for the feedback, this is more about how we want to organize the examples in this repo. We have three examples:
IMO we should have all three examples in the repo, if we agree to that, then the question is how do we want to organize the manifests to accommodate all three. Option 1: make the inference-pool name generic ( Option 2: continue to be specific in the InferencePool name which means we need to create a separate set of provider-specific gateway manifests for each example. I am in favor of option 2.2 |
IMO this discussion hides an even more important question - what is the mental model of GIE InferencePool? if the mental model and what we want people to interpret when they see the code/yamls is that InferencePool is per base model - then it makes sense to use the base model name in the resources. on the other hand, if the pool can be used for multiple base model, it doesn’t really make sense to use base model name in the resource name and could be very confusing. when I first read GIE documentation I’ve seen option 2, but manifests look like option 1. |
As of now, the EPP implementation assumes an InferencePool serves a single base model, multi-tenancy comes from adapters. The API doesn't enforce that though. I think we should clarify that in the docs and align the examples with what is currently supported. |
yes. if we assume a single base model per pool, that should be documented clearly. |
My concern is keeping them in sync, this will lead to a fair amount of duplication. |
I don't have a strong opinion, but I am leaning towards 2.2 : Maintain a simple, more static example in plain manifests, while moving more dynamic configurations to helm. If we agree on this simple principle, it can save a lot of discussions like this in the future. |
++ If we can't make heterogeneous pools, I think it's actually more important to name the pools to reflect the mono-model server nature of them, and doing so in a helm chart seems the most maintainable. I think @nirrozenbaum brings up reasonable points but I was gonna say:
I worry this would set a toil-heavy pattern. |
This was something we discussed in depth about in terms of what to expect form other model servers, but at the end of the day servers like triton have hundreds of prometheus metrics that all have a consistent naming convention separate from vLLM (the names Gateway defaults to). What we did do is add new metrics to TensorRT-LLM backend that weren't there before to avoid having to combine multiple metrics into single metrics on Gateway's side. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This PR updates the naming for the gke gateway. For consistency, all presently supported gateways should be updated to match.
I see this PR added support for the required metrics defined by the model server protocol. However, I can't tell if Triton supports the LoRA adapter section of the model server protocol. I wouldn't think so due to:
If this is the case, how can Triton be included if it does not support the requirements stated in the model server protocol? |
xref #710 |
Renames the InferencePool, Services, and Deployments to not have vLLM in the name in preparation for adding Triton support.