Add prefex aware routing proposal #602

liu-cong · 2025-03-28T16:47:57Z

This proposal was initially discussed in #498

k8s-ci-robot · 2025-03-28T16:48:05Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-03-28T16:48:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liu-cong
Once this PR has been reviewed and has the lgtm label, please assign sergeykanzhelev for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-03-28T16:48:14Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`2595221`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67e6d2c0828ed60008712031
😎 Deploy Preview	https://deploy-preview-602--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ahg-g · 2025-04-07T19:57:42Z

docs/proposals/005-prefix-cache-aware-routing-proposal/README.md

+
+1. **Prefix affinity consistent hashing**
+
+This goes a step beyond the session affinity by using a prefix aware hash function to route requests with similar prefixes to the same or similar servers. A naive hash function can be just taking the hash of the first N characters/tokens of the request, and therefore all requests with the same first N characters/tokens will be routed to the same server. The [vLLM production stack](https://github.com/vllm-project/production-stack/issues/59) is exploring this strategy using simhash, and preliminary experiments showed mixed results. KubeAI uses a simple strategy to only hash request prefix up to a configurable `prefixCharLength`. Its effectiveness is likely highly dependent on the input length distribution.


is that a moving window of up to prefixCharLength? or does it always has exactly prefixCharLength characters?

ahg-g · 2025-04-07T19:58:24Z

docs/proposals/005-prefix-cache-aware-routing-proposal/README.md

+
+Pros:
+
+* Easy to explain (compared to hashing) and likely more effective than hashing strategy.


you mean "than consistent hashing strategy"?

ahg-g · 2025-04-07T20:39:28Z

docs/proposals/005-prefix-cache-aware-routing-proposal/README.md

+1. Prefix affinity needs to be aware of the server load, otherwise we will create hot spots. We can use queue length and k-v cache utilization to understand the server load. This is similar to the [queue depth threshold](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/2a615e981228aa6ffc2a89219c986ac863dde776/pkg/epp/scheduling/scheduler.go#L40) for LoRA affinity.
+
+
+## Proposal 


+1 to start with this approach since it seems relatively simple to implement, but also in theory should be more resilient than the other two options.

smarterclayton · 2025-04-21T14:16:43Z

Cong's PoC is in main...liu-cong:llm-instance-gateway:prefix-poc (or at least, a version of it is) for those interested.

kfswain · 2025-04-23T16:21:30Z

docs/proposals/005-prefix-cache-aware-routing-proposal/README.md

+
+## Design Options
+
+1. **Session affinity**


Consider switching options to Header 3, its easy to have the options blend together as is, and since they have sub-bullets using 1. resets the count and they all ahve the value of 1.

Suggested change

1. **Session affinity**

### **Session affinity**

Add prefex aware routing proposal

2595221

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 28, 2025

k8s-ci-robot requested review from kfswain and robscott March 28, 2025 16:48

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 28, 2025

liu-cong mentioned this pull request Mar 28, 2025

Prefix Cache Aware Proposal #498

Open

ahg-g reviewed Apr 7, 2025

View reviewed changes

kfswain mentioned this pull request Apr 22, 2025

Design clarification question for implementers #724

Open

kfswain reviewed Apr 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prefex aware routing proposal #602

Add prefex aware routing proposal #602

liu-cong commented Mar 28, 2025 •

edited

Loading

k8s-ci-robot commented Mar 28, 2025

k8s-ci-robot commented Mar 28, 2025

netlify bot commented Mar 28, 2025 •

edited

Loading

ahg-g Apr 7, 2025

ahg-g Apr 7, 2025

ahg-g Apr 7, 2025

smarterclayton commented Apr 21, 2025

kfswain Apr 23, 2025


		1. Prefix affinity consistent hashing

		This goes a step beyond the session affinity by using a prefix aware hash function to route requests with similar prefixes to the same or similar servers. A naive hash function can be just taking the hash of the first N characters/tokens of the request, and therefore all requests with the same first N characters/tokens will be routed to the same server. The [vLLM production stack](https://github.com/vllm-project/production-stack/issues/59) is exploring this strategy using simhash, and preliminary experiments showed mixed results. KubeAI uses a simple strategy to only hash request prefix up to a configurable `prefixCharLength`. Its effectiveness is likely highly dependent on the input length distribution.


		Pros:

		* Easy to explain (compared to hashing) and likely more effective than hashing strategy.

		1. Prefix affinity needs to be aware of the server load, otherwise we will create hot spots. We can use queue length and k-v cache utilization to understand the server load. This is similar to the [queue depth threshold](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/2a615e981228aa6ffc2a89219c986ac863dde776/pkg/epp/scheduling/scheduler.go#L40) for LoRA affinity.


		## Proposal

Add prefex aware routing proposal #602

Are you sure you want to change the base?

Add prefex aware routing proposal #602

Conversation

liu-cong commented Mar 28, 2025 • edited Loading

k8s-ci-robot commented Mar 28, 2025

k8s-ci-robot commented Mar 28, 2025

netlify bot commented Mar 28, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

ahg-g Apr 7, 2025

Choose a reason for hiding this comment

ahg-g Apr 7, 2025

Choose a reason for hiding this comment

ahg-g Apr 7, 2025

Choose a reason for hiding this comment

smarterclayton commented Apr 21, 2025

kfswain Apr 23, 2025

Choose a reason for hiding this comment

liu-cong commented Mar 28, 2025 •

edited

Loading

netlify bot commented Mar 28, 2025 •

edited

Loading