Refactor scheduler to run plugins #677

liu-cong · 2025-04-10T19:24:17Z

Inspired by the kube scheduler framework, this PR adds the following scheduler "plugins" that run in the following order:

PreSchedule: A list of plugins, runs at the beginning of each scheduling request. This is a noop in current scheduler. In the prefix caching followup PR, I will use this to pre-calculate data such as request block hashes for the prefix scorer to consume later on.
Filter: A list, filters down the list of available pods. This is the same with current scheduler filter interface.
Score: A list of scorers to run for each pod, the final score is calculated as a weighted sum. In the follow up PR, I will use this to score pods based on prefix matching, queue depth and kv-cache.
Picker, a single plugin, that picks the final pod. Currently this randomly picks a pod. We will have a "topK" picker that picks pods with top scores.
- PostSchedule: A list, runs after a scheduling decision is made (a targetPod is picked). Currently a noop. In the follow up prefix caching PR, I will use this to update the cache lookup table.

This is a pure refactor.

This should be an incremental step towards making the scheduler more pluggable and even dynamically configurable.

benchmark

I ran a benchmark to make sure there are no regressions. In the benchmark run, the refactored EPP performed slightly better . But this could just be variances in different benchmark runs. I don't expect changes in performance as this is a pure refactor.

Baseline:

Refactor:

netlify · 2025-04-10T19:24:34Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`83589eb`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67fd56715fad4300087208b4
😎 Deploy Preview	https://deploy-preview-677--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

liu-cong · 2025-04-10T19:26:23Z

cc @ahg-g @nirrozenbaum @smarterclayton @kfswain @kaushikmitr @LukeAVanDrie

liu-cong · 2025-04-10T19:27:03Z

pkg/epp/scheduling/config/config.go

This is moved from scheduler.go.

liu-cong · 2025-04-10T19:27:42Z

pkg/epp/scheduling/plugins/filter.go

@@ -257,23 +254,46 @@ func loRASoftAffinityFilterFunc(ctx *types.Context, pods []*types.PodMetrics) ([
 	return filtered_available, nil
 }

+var HasCapacityFilter = &BasicFilter{


These are moved from scheduler.go. It's cleaner to group all filters here.

liu-cong · 2025-04-10T19:28:21Z

pkg/epp/scheduling/plugins/picker.go

+}
+
+func (rp *RandomPicker) Pick(ctx *types.Context, pods []types.Pod) (*types.Result, error) {
+	ctx.Logger.V(logutil.DEBUG).Info(fmt.Sprintf("Selecting a random pod from %d candidates: %+v", len(pods), pods))


This moved from the existing "random picking" behavior in scheduler.go

as noted in other comment, this seems to belong in scheduler. I would put back there.

RandomPicker is a scheduler plugin, so IMO it makes sense to have it here.

liu-cong · 2025-04-10T19:29:27Z

pkg/epp/scheduling/scheduler.go

+	return "DefaultPlugin"
+}
+
+func (p *defaultPlugin) Filter(ctx *types.Context, pods []types.Pod) ([]types.Pod, error) {


The defaultPlugin implements the existing scheduler filter, and the random picking behavior. Other plugin interfaces are all noop.

Can we incorporate scoring semantics to the current filtering logic? That way we can integrate it with future scoring plugins if needed? To clarify the kube documentaiton says: "The scheduler will call each scoring plugin for each node. There will be a well defined range of integers representing the minimum and maximum scores. After the NormalizeScore phase, the scheduler will combine node scores from all plugins according to the configured plugin weights.": I believe the current default filter does not assign any scores?

Can we incorporate scoring semantics to the current filtering logic?

I am not sure I get what you mean, if I get it wrong , please clarify! We have 2 different plugins: Filter, which removes "ineligible" pods. Scorer, which gives each pod (analogous to the node in kube scheduler) a score. And finally we have a Picker which can pick pods based on the scores.

They have different purposes. That said, you can use some "score" mechanisms in a particular filter implementation. However that's implementation detail and not part of the Filter interface, if that's what you are asking.

The current filters are more of "soft filters". The idea is to move them to the Scorer interface if applicable. For example, the minKvCache filter can be easily converted to a scorer which just ranks pods based on kv cache utilization.

liu-cong · 2025-04-10T19:30:40Z

pkg/epp/scheduling/types/types.go

 	}
 	return pm
 }
+
+// Result captures the scheduler result.
+type Result struct {


Creating a Result struct so it will be easier to extend in the future (such as adding fallback)

I get the idea of trying to built it for future changes. I think at this point this is more confusing than helping.
for example when looking on picker interface and more specifically on the func decleration:

Pick(ctx *Context, pods []Pod) (*Result, error)

I was expecting to see that Pod is selected and not *Result.
basically it's more or less the same comment as pre/post schedule.. as long as it's not used, it's not needed.

Result gives us flexibility to add more, such as "reason for this pod is picked", or the "fallback pod" use case. I wanted to do this now because as we add more plugin interfaces and implementations, changing the return type will become harder. Also from a readability perspective, returning a Result is pretty readable IMO.

if we decide to keep the result, I would at least rename to SchedulingResult or something like that, as Result is very general term.

ahg-g

This is a good start, I think we need to define a configuration API for those plugins, but that can come as a followup.

pkg/epp/handlers/request.go

ahg-g · 2025-04-14T02:53:12Z

pkg/epp/handlers/request.go

@@ -96,8 +96,7 @@ func (s *StreamingServer) HandleRequestBody(
 	endpoint := targetPod.Address + ":" + strconv.Itoa(int(pool.Spec.TargetPortNumber))

 	logger.V(logutil.DEFAULT).Info("Request handled",
-		"model", llmReq.Model, "targetModel", llmReq.ResolvedTargetModel, "endpoint", targetPod, "endpoint metrics",
-		fmt.Sprintf("%+v", target))
+		"model", llmReq.Model, "targetModel", llmReq.ResolvedTargetModel, "endpoint", targetPod)


Logging the metrics of the picked endpoint is useful here.

My concern is that the metric list may grow, making this log very long. I am not sure how useful it is, because the decision is made upon multiple pods, you will need to compare this pod to other pods to determine whether the decision is good or not (if this is the intention of having this log)

pkg/epp/metrics/metrics.go

pkg/epp/scheduling/types/types.go

pkg/epp/scheduling/types/interfaces.go

pkg/epp/scheduling/plugins/filter.go

pkg/epp/scheduling/scheduler.go

nirrozenbaum · 2025-04-14T17:10:12Z

pkg/epp/scheduling/plugins/filter.go

@@ -14,91 +14,88 @@ See the License for the specific language governing permissions and
 limitations under the License.
 */

-package scheduling
+package plugins

 import (


I was expecting to see a different structure.
I expected to see filter dir (and package), under plugins.
then a file filter.go with the interface and some general definitions and types.
then for each filter I expect to see a different file.
it may end up with multiple files but in terms of readability and maintainability I think it's much more easier to maintain and understand

then a file filter.go with the interface and some general definitions and types.

I prefer defining all the plugin interfaces in one place (currently in interfaces.go). Open to feedback but I feel this is the most discoverable approach.

then for each filter I expect to see a different file.

I like this. I think this is what this will look like eventually. However I'd like to defer this to minimize the changes.
Currently we have plugins/filter.go which is a pretty small file, and we may deprecate some of these filters to scorers.

However I'd like to defer this to minimize the changes.

+1, this PR is already large enough.

pkg/epp/scheduling/types/interfaces.go

pkg/epp/scheduling/types/types.go

nirrozenbaum · 2025-04-14T17:54:15Z

pkg/epp/metrics/metrics.go

+	SchedulerPluginProcessingLatencies = compbasemetrics.NewHistogramVec(
+		&compbasemetrics.HistogramOpts{
+			Subsystem: EPPComponent,
+			Name:      "scheduler_plugin_duration_seconds",


that's great!

pkg/epp/scheduling/plugins/filter.go

nirrozenbaum · 2025-04-14T18:26:11Z

pkg/epp/scheduling/plugins/filter.go

@@ -107,12 +104,12 @@ func (f *decisionTreeFilter) Filter(ctx *types.Context, pods []*types.PodMetrics
 }

 // filterFunc filters a set of input pods to a subset.
-type filterFunc func(ctx *types.Context, pods []*types.PodMetrics) ([]*types.PodMetrics, error)
+type filterFunc func(ctx *types.Context, pods []types.Pod) ([]types.Pod, error)


I noticed that error is not used anywhere other than the following two places:

DropRequestFilter.

toFilterFunc function only in case no pods are left after the filter is applied.

I think this is a wrong usage of the filter terminology and returning error here may be very confusing for the reader.

filter by definition gets a set of pods, apply some conditional check to each and return a subset of the pods (the ones that returned true). DropRequest is not a filter by this definition.
toFilterFunc can return empty slice of pods and the return value should be checked in the caller (instead of checking for err, we should check if len(pods)==0).

with that said, the filter decision tree becomes redundant, cause there are no errors when applying filters. therefore, there should be no "NextOnFailure". the decision tree can become just a chain of filters without the use of error in the return value and with the addition of checking for length of the returned slice.

applying a filter that ends up with 0 pods is not an error :)
the filter succeeded.
the result may be not exactly what the caller wanted, but there is no error in the filter.

I get your point. Again would like to defer such changes to minimize changes. Would you like to do a follow up PR or just open an issue for this?

let's do both. :)
please create an issue with pointer to this comment so we won't forget and I'm happy to do a followup PR.

I checked out your code and played with the idea of removing errors. it's was quite straight forward with no issues. I can create a PR to your branch if you'd like.

nirrozenbaum · 2025-04-15T05:26:14Z

pkg/epp/scheduling/plugins/noop.go

+// NoopPlugin provides a default, no-operation implementation of the Plugin interface.
+// It can be embedded in other plugin implementations to avoid boilerplate code for
+// unused methods.
+type NoopPlugin struct{}


is this used somewhere?
I didn't find any usage.

In general, this doesn't seem to belong here. probably good for testing purposes and belongs under one of the test files (internally in the test setup).

nirrozenbaum · 2025-04-15T05:51:18Z

pkg/epp/scheduling/types/interfaces.go

+// Picker picks the final pod(s) to send the request to.
+type Picker interface {
+	Plugin
+	Pick(ctx *Context, pods []Pod) (*Result, error)


why pick returns an error?
there is no error I can think of in picking an entry from a slice.

@liu-cong does your KV cache code have a use case for returning an error?

nirrozenbaum · 2025-04-15T05:54:59Z

pkg/epp/scheduling/scheduler.go

+	for _, pod := range pods {
+		score, err := runScorersForPod(ctx, s.scorers, pod)
+		if err != nil {
+			return err
+		}
+		pod.SetScore(score)
+	}


this logic is missing the score normalization phase and the scorer weight.
normalization part: not all scorers are necessarily using the same score range. we need to normalize the scores of all scorers to the same range and use some formula for calculating weighted score, e.g.,: weightedScore = w1s1 + w2s2 ..., where w1 is the weight of scorer1 and s1 is the NORMALIZED score of scorer1.
weight: not all scorers necessarily have the same weight. we might want to define different weights for different scorers.

nirrozenbaum · 2025-04-15T06:00:42Z

pkg/epp/scheduling/scheduler.go

+	postSchedulePlugins []types.PostSchedule
+	filters             []types.Filter
+	scorers             []types.Scorer
+	picker              types.Picker


I'm reading this again, and Picker interface seems redundant to me.
the scheduler itself IS the picker.
scheduler runs filters to remove pods, scorers to score the filtered pods, and then pick a pod from the result list based on the calculated score of each.
maybe it would make sense to put something like SchedulerStrategy. I can image two possible strategies:

pick the highest, fallback to second, etc.

pick randomly while using the score for probabilities.
in current code the only option is random and therefore I wouldn't do the above suggestion in this PR, but only when (and if) needed.

nirrozenbaum · 2025-04-15T06:28:19Z

pkg/epp/scheduling/scheduler.go

+		before := time.Now()
+		filteredPods, err := filter.Filter(ctx, pods)
+		metrics.RecordSchedulerPluginProcessingLatency(types.FilterPluginType, filter.Name(), time.Since(before))


this is not measurement of each filter. since the filter is defined as a decision tree, these lines measures the decision tree time, means it calculates all filters.
this is another reason for transitioning to filter chain and not decision tree (when removing error from filter as noted in the other comment). we should be able to see how much time each filter took.

nirrozenbaum · 2025-04-15T06:31:59Z

pkg/epp/scheduling/scheduler.go

+}
+
+// Iterate through each scorer in the chain and accumulate the scores.
+func runScorersForPod(ctx *types.Context, scorers []types.Scorer, pod types.Pod) (float64, error) {


this should be part of the Scheduler functions, e.g., func (s *Scheduler) runScorersForPod and then we shouldn't pass scorers as argument.

nirrozenbaum · 2025-04-15T06:34:32Z

pkg/epp/scheduling/types/interfaces.go

+// Scorer defines the interface for scoring pods based on context.
+type Scorer interface {
+	Plugin
+	Score(ctx *Context, pod Pod) (float64, error)


same as comments on filter error and picker error. I don't think scorer should ever return an error.
if there is not real reason for using error in returned values we should remove those.
adding error return values in all interfaces make code harder to follow.

I don't think scorer should ever return an error. If there is not real reason for using error in returned values we should remove those.

@liu-cong does your KV cache code have a use case for returning an error for the Score() method?

nirrozenbaum · 2025-04-15T06:42:20Z

pkg/epp/scheduling/types/types.go

 	*backendmetrics.Pod
 	*backendmetrics.Metrics
 }

-func NewContext(ctx context.Context, req *LLMRequest, pods []*PodMetrics) *Context {
+func NewContext(ctx context.Context, req *LLMRequest, pods []Pod) *Context {


consider renaming to SchedulerContext or something like that.
when first seeing context it was confusing (I was expecting the golang context.Context).

nirrozenbaum · 2025-04-15T06:46:50Z

pkg/epp/scheduling/plugins/picker.go

+
+func (rp *RandomPicker) Pick(ctx *types.Context, pods []types.Pod) (*types.Result, error) {
+	ctx.Logger.V(logutil.DEBUG).Info(fmt.Sprintf("Selecting a random pod from %d candidates: %+v", len(pods), pods))
+	i := rand.Intn(len(pods))


this is a general random logic without any usage of the scores. it misses the whole point of scoring pods.

@liu-cong correct me if I'm wrong, but this PR's intent is to refactor for the scheduler plugin architecture. Can you confirm that a ScorerPicker will be introduced to pick a Pod based on the score set by runScorerPlugins() in a follow-on PR?

nirrozenbaum · 2025-04-17T13:01:25Z

/assign

danehans

Other than a few nits that can be resolved in a follow-on PR, LGTM.

danehans · 2025-04-17T16:08:55Z

pkg/epp/scheduling/plugins/filter.go

@@ -14,91 +14,88 @@ See the License for the specific language governing permissions and
 limitations under the License.
 */

-package scheduling
+package plugins

 import (


However I'd like to defer this to minimize the changes.

+1, this PR is already large enough.

danehans · 2025-04-17T16:16:39Z

pkg/epp/scheduling/types/types.go

-func ToSchedulerPodMetrics(pods []backendmetrics.PodMetrics) []*PodMetrics {
-	pm := make([]*PodMetrics, 0, len(pods))
+func ToSchedulerPodMetrics(pods []backendmetrics.PodMetrics) []Pod {
+	pm := make([]Pod, 0, len(pods))


Since the return type is changed, consider renaming the var from pm to p or something similar to the return type.

danehans · 2025-04-17T16:34:59Z

pkg/epp/scheduling/types/interfaces.go

+// Scorer defines the interface for scoring pods based on context.
+type Scorer interface {
+	Plugin
+	Score(ctx *Context, pod Pod) (float64, error)


I don't think scorer should ever return an error. If there is not real reason for using error in returned values we should remove those.

@liu-cong does your KV cache code have a use case for returning an error for the Score() method?

danehans · 2025-04-17T16:35:22Z

pkg/epp/scheduling/types/interfaces.go

+// Picker picks the final pod(s) to send the request to.
+type Picker interface {
+	Plugin
+	Pick(ctx *Context, pods []Pod) (*Result, error)


@liu-cong does your KV cache code have a use case for returning an error?

danehans · 2025-04-17T17:00:24Z

@liu-cong when you have a moment, can you resolve or respond to @nirrozenbaum's feedback?

kfswain · 2025-04-21T15:36:00Z

This PR seems to be in a workable state for further iterations.

/lgtm
/approve

k8s-ci-robot · 2025-04-21T15:36:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2025-04-21T15:46:56Z

/hold

I would like to ensure we don't have a performance regression, holding to have a quick discussion on this

ahg-g · 2025-04-22T20:41:31Z

I ran the benchmark again, I think the regression is not related to the refactor PR, it is an earlier change. I will try to find which exact PR caused the regression, but I think we can move forward with either Nir's slimmed down PR or this PR

kfswain · 2025-04-22T21:59:20Z

/unhold

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2025

k8s-ci-robot requested review from kfswain and nirrozenbaum April 10, 2025 19:24

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 10, 2025

liu-cong commented Apr 10, 2025

View reviewed changes

liu-cong force-pushed the scheduler-plugin branch from 72fbea6 to ba41961 Compare April 10, 2025 19:33

Refactor scheduler to run plugins

71c0f0c

liu-cong force-pushed the scheduler-plugin branch from ba41961 to 71c0f0c Compare April 10, 2025 20:07

Add scheduler plugin latency metric

de87cd3

ahg-g reviewed Apr 14, 2025

View reviewed changes

Address comments

1a7cf07