Rename resources to be model server generic instead of referencing vLLM #676

BenjaminBraunDev · 2025-04-10T17:15:15Z

Renames the InferencePool, Services, and Deployments to not have vLLM in the name in preparation for adding Triton support.

k8s-ci-robot · 2025-04-10T17:15:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BenjaminBraunDev
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-04-10T17:15:25Z

Hi @BenjaminBraunDev. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-04-10T17:15:31Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`f2a9b34`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67f7fca5d1845600082eda09
😎 Deploy Preview	https://deploy-preview-676--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

nirrozenbaum · 2025-04-10T18:32:10Z

/ok-to-test
@BenjaminBraunDev you probably need to change also in tests

ahg-g · 2025-04-10T18:38:02Z

@nirrozenbaum @kfswain The intent behind this change is to introduce triton as another option for deploying the model server. This is kinda the same naming challenge we are having with the cpu-based deployment.

Any thoughts on how we want to do this? do we want to continue to have the inference pool model and model server specific? in which case we will need to maintain a yaml for each flavor, and that will also require duplication in the vendor specific manifests.

nirrozenbaum · 2025-04-10T19:30:49Z

@nirrozenbaum @kfswain The intent behind this change is to introduce triton as another option for deploying the model server. This is kinda the same naming challenge we are having with the cpu-based deployment.

Any thoughts on how we want to do this? do we want to continue to have the inference pool model and model server specific? in which case we will need to maintain a yaml for each flavor, and that will also require duplication in the vendor specific manifests.

@ahg-g we can take the discussion in an issue out of the PR not to lose track. writing brain dump of few thoughts I had about this topic:

we’re working with a single Inf Pool. maybe the label selector for selecting pods can be a general label, similar to how prometheus has “scrape: true”, we can have a label on pods “inference-routing: true” (or something along these lines).
the service and deployment names can also become general, e.g., inference-routing-svc or something like that.
if one wants to deploy multiple pools we can do that in multiple namespaces. but as of now this is not relevant.
having general label like “inference-routing: true” allows mixing model servers from different vendors in the same pool, which I don’t know if needed but sounds pretty nice.

kfswain · 2025-04-10T19:45:00Z

Any thoughts on how we want to do this?

We could make this configurable in the helm chart? And for the non-configurable guide, I think it's okay to leave vllm for now? It's less about a specific vendor, and more just helping portray the idea of what an inferencePool should shape.

Or, in theory, we could have our GPU deployment actually just be multiple model servers with the same labels, that could be useful to show the heterogeneity of a pool. Potentially similar to what Nir is thinking, but without locking down a specific label name (We should allow a user to have multiple pools in the same namespace if desired, personally no reason to impose that restriction imo).

nirrozenbaum · 2025-04-10T20:12:04Z

yes, I agree there is no need to restrict one inf pool per ns. it was just a brain dump trying to think out loud.
another point - maybe it would be easier if inf pool selects inf model (by labels) and not the other way around where inf model specifies the pool ref. it’s also not so intuitive and it uses pool name which makes us use name instead of labels selection.

BenjaminBraunDev · 2025-04-10T21:23:29Z

My thinking was for the time being we could have model server branches meet at the InferencePool level defined in inferencepool-resources.yaml, and switch out that app label to target whichever model server deployment the customer is using (vllm, triton, etc.)

The one other thing is that they will have to switch out the metric flags in the EPP deployment for different model servers, as Triton and vLLM have different names, so both those have to change in the inferencepool-resources.yaml

Other options include having separate InferencePools or EPPs for the different model servers.

kfswain · 2025-04-10T22:06:26Z

yes, I agree there is no need to restrict one inf pool per ns. it was just a brain dump trying to think out loud.
another point - maybe it would be easier if inf pool selects inf model (by labels) and not the other way around where inf model specifies the pool ref. it’s also not so intuitive and it uses pool name which makes us use name instead of labels selection.

Yeah NP. That's an interesting idea, and would definitely work as long as we keep the InferencePool as the backend ref. The InferencePool would then have 2 selectors. But i could see that being an option

kfswain · 2025-04-10T22:12:20Z

The one other thing is that they will have to switch out the metric flags in the EPP deployment for different model servers, as Triton and vLLM have different names, so both those have to change in the inferencepool-resources.yaml

These?

gateway-api-inference-extension/cmd/epp/main.go

Line 94 in 1ba13f3

// metric flags

I was hoping with Triton we would have found a way to merge the two. That was kind of the idea behind the Model Server Protocol, is that an intractable problem due to prometheus naming? Ideally we should get to a state where the EPP does not care about the underlying serving mechanism and serve hetero-model servers

liu-cong · 2025-04-10T22:27:27Z

Having a heterogeneous pool is cool, but we are not there yet. Currently there are configurations such as the scheduler config that are likely specific to the model server deployment. I would argue for not mixing different model servers in the user guides. I prefer to have different label selectors for vllm and Triton.

BTW https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/537/files Adds the helm template for Triton deployment. We can modify the charts to update the InferencePool label selector if Triton is specified.

This PR is reasonable to me, if we also update the helm charts

ahg-g · 2025-04-10T23:47:25Z

Thanks all for the feedback, this is more about how we want to organize the examples in this repo.

We have three examples:

vllm on gpu running llama3
vllm on cpu running qwen
triton on gpu running llama3

IMO we should have all three examples in the repo, if we agree to that, then the question is how do we want to organize the manifests to accommodate all three.

Option 1: make the inference-pool name generic (my-inference-pool) so that the provider-specific gateway manifests are the same across all examples. We also remove inferencepool-resources.yaml file and in the guide we rely on the helm chart to create the InferencePool and customize it to select the right model server deployment.

Option 2: continue to be specific in the InferencePool name which means we need to create a separate set of provider-specific gateway manifests for each example.
Option 2.1: create a separate set of manifests explicitly
Option 2.2: create a helm chart to deploy the gateway, and this helm chart will have a provider flag and inferencePool name flag.

I am in favor of option 2.2

nirrozenbaum · 2025-04-11T04:44:39Z

IMO this discussion hides an even more important question - what is the mental model of GIE InferencePool?

if the mental model and what we want people to interpret when they see the code/yamls is that InferencePool is per base model - then it makes sense to use the base model name in the resources.

on the other hand, if the pool can be used for multiple base model, it doesn’t really make sense to use base model name in the resource name and could be very confusing.

when I first read GIE documentation I’ve seen option 2, but manifests look like option 1.

ahg-g · 2025-04-11T14:39:07Z

As of now, the EPP implementation assumes an InferencePool serves a single base model, multi-tenancy comes from adapters. The API doesn't enforce that though. I think we should clarify that in the docs and align the examples with what is currently supported.

nirrozenbaum · 2025-04-11T14:51:27Z

yes. if we assume a single base model per pool, that should be documented clearly.
option 1 from the ones you suggested is not aligned with the mental model.
I’m in favor of option 2.1, create different set of manifests for each. that’s a very common pattern to see “examples/samples” dir, with different manifests for each sample.
it also should be easier to test e2e new manifest this way as the tests can get path to manifest file.

ahg-g · 2025-04-11T15:34:50Z

My concern is keeping them in sync, this will lead to a fair amount of duplication.

liu-cong · 2025-04-11T16:38:28Z

I don't have a strong opinion, but I am leaning towards 2.2 : Maintain a simple, more static example in plain manifests, while moving more dynamic configurations to helm. If we agree on this simple principle, it can save a lot of discussions like this in the future.

kfswain · 2025-04-11T17:17:54Z

I am in favor of option 2.2

++

If we can't make heterogeneous pools, I think it's actually more important to name the pools to reflect the mono-model server nature of them, and doing so in a helm chart seems the most maintainable.

I think @nirrozenbaum brings up reasonable points but I was gonna say:

My concern is keeping them in sync, this will lead to a fair amount of duplication.

I worry this would set a toil-heavy pattern.

BenjaminBraunDev · 2025-04-14T16:14:13Z

The one other thing is that they will have to switch out the metric flags in the EPP deployment for different model servers, as Triton and vLLM have different names, so both those have to change in the inferencepool-resources.yaml

These?

gateway-api-inference-extension/cmd/epp/main.go

Line 94 in 1ba13f3

// metric flags

I was hoping with Triton we would have found a way to merge the two. That was kind of the idea behind the Model Server Protocol, is that an intractable problem due to prometheus naming? Ideally we should get to a state where the EPP does not care about the underlying serving mechanism and serve hetero-model servers

This was something we discussed in depth about in terms of what to expect form other model servers, but at the end of the day servers like triton have hundreds of prometheus metrics that all have a consistent naming convention separate from vLLM (the names Gateway defaults to). What we did do is add new metrics to TensorRT-LLM backend that weren't there before to avoid having to combine multiple metrics into single metrics on Gateway's side.

k8s-ci-robot · 2025-04-17T20:27:22Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

danehans · 2025-04-18T20:05:15Z

This PR updates the naming for the gke gateway. For consistency, all presently supported gateways should be updated to match.

What we did do is add new metrics to TensorRT-LLM backend that weren't there before to avoid having to combine multiple metrics into single metrics on Gateway's side.

I see this PR added support for the required metrics defined by the model server protocol. However, I can't tell if Triton supports the LoRA adapter section of the model server protocol. I wouldn't think so due to:

Note the current algorithm in the reference EPP is highly biased towards vLLM's current dynamic LoRA implementation.

If this is the case, how can Triton be included if it does not support the requirements stated in the model server protocol?

danehans · 2025-04-18T20:06:02Z

xref #710

Rename resources to be model server generic instead of referencing vLLM

f2a9b34

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 10, 2025

k8s-ci-robot requested review from Jeffwan and kfswain April 10, 2025 17:15

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 10, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 10, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename resources to be model server generic instead of referencing vLLM #676

Rename resources to be model server generic instead of referencing vLLM #676

BenjaminBraunDev commented Apr 10, 2025

k8s-ci-robot commented Apr 10, 2025

k8s-ci-robot commented Apr 10, 2025

netlify bot commented Apr 10, 2025 •

edited

Loading

nirrozenbaum commented Apr 10, 2025

ahg-g commented Apr 10, 2025

nirrozenbaum commented Apr 10, 2025

kfswain commented Apr 10, 2025

nirrozenbaum commented Apr 10, 2025

BenjaminBraunDev commented Apr 10, 2025 •

edited

Loading

kfswain commented Apr 10, 2025

kfswain commented Apr 10, 2025

liu-cong commented Apr 10, 2025

ahg-g commented Apr 10, 2025 •

edited

Loading

nirrozenbaum commented Apr 11, 2025

ahg-g commented Apr 11, 2025

nirrozenbaum commented Apr 11, 2025 •

edited

Loading

ahg-g commented Apr 11, 2025

liu-cong commented Apr 11, 2025

kfswain commented Apr 11, 2025

BenjaminBraunDev commented Apr 14, 2025

k8s-ci-robot commented Apr 17, 2025

danehans commented Apr 18, 2025

danehans commented Apr 18, 2025

Rename resources to be model server generic instead of referencing vLLM #676

Are you sure you want to change the base?

Rename resources to be model server generic instead of referencing vLLM #676

Conversation

BenjaminBraunDev commented Apr 10, 2025

k8s-ci-robot commented Apr 10, 2025

k8s-ci-robot commented Apr 10, 2025

netlify bot commented Apr 10, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

nirrozenbaum commented Apr 10, 2025

ahg-g commented Apr 10, 2025

nirrozenbaum commented Apr 10, 2025

kfswain commented Apr 10, 2025

nirrozenbaum commented Apr 10, 2025

BenjaminBraunDev commented Apr 10, 2025 • edited Loading

kfswain commented Apr 10, 2025

kfswain commented Apr 10, 2025

liu-cong commented Apr 10, 2025

ahg-g commented Apr 10, 2025 • edited Loading

nirrozenbaum commented Apr 11, 2025

ahg-g commented Apr 11, 2025

nirrozenbaum commented Apr 11, 2025 • edited Loading

ahg-g commented Apr 11, 2025

liu-cong commented Apr 11, 2025

kfswain commented Apr 11, 2025

BenjaminBraunDev commented Apr 14, 2025

k8s-ci-robot commented Apr 17, 2025

danehans commented Apr 18, 2025

danehans commented Apr 18, 2025

netlify bot commented Apr 10, 2025 •

edited

Loading

BenjaminBraunDev commented Apr 10, 2025 •

edited

Loading

ahg-g commented Apr 10, 2025 •

edited

Loading

nirrozenbaum commented Apr 11, 2025 •

edited

Loading