kubernetes-sigs · k8s-ci-robot · Apr 23, 2025 · Apr 9, 2025 · Apr 9, 2025 · Apr 9, 2025
diff --git a/api/v1alpha2/inferencemodel_types.go b/api/v1alpha2/inferencemodel_types.go
@@ -126,7 +126,7 @@ type PoolObjectReference struct {
 }
 
 // Criticality defines how important it is to serve the model compared to other models.
-// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional(use a pointer), and set no default.
+// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional (use a pointer), and set no default.
 // This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior.
 // +kubebuilder:validation:Enum=Critical;Standard;Sheddable
 type Criticality string

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -63,6 +63,7 @@ nav:
       - Getting started: guides/index.md
       - Adapter Rollout: guides/adapter-rollout.md
       - Metrics: guides/metrics.md
+      - Replacing an Inference Pool: guides/replacing-inference-pool.md
     - Implementer's Guide: guides/implementers.md
   - Performance:
     - Benchmark: performance/benchmark/index.md

diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md
@@ -7,28 +7,56 @@
 
 ## Background
 
-The InferencePool resource is a logical grouping of compute resources, e.g. Pods, that run model servers. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin. 
+The **InferencePool** API defines a group of Pods (containers) dedicated to serving AI models. Pods within an InferencePool share the same compute configuration, accelerator type, base language model, and model server. This abstraction simplifies the management of AI model serving resources, providing a centralized point of administrative configuration for Platform Admins.
 
-It is expected for the InferencePool to:
+An InferencePool is expected to be bundled with an [Endpoint Picker](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) extension. This extension is responsible for tracking key metrics on each model server (i.e. the KV-cache utilization, queue length of pending requests, active LoRA adapters, etc.) and routing incoming inference requests to the optimal model server replica based on these metrics. An EPP can only be associated with a single InferencePool. The associated InferencePool is specified by the [poolName](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/manifests/inferencepool-resources.yaml#L54) and [poolNamespace](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/manifests/inferencepool-resources.yaml#L56) flags. An HTTPRoute can have multiple backendRefs that reference the same InferencePool and therefore routes to the same EPP. An HTTPRoute can have multiple backendRefs that reference different InferencePools and therefore routes to different EPPs.
 
- - Enforce fair consumption of resources across competing workloads
- - Efficiently route requests across shared compute (as displayed by the PoC)
-
-It is _not_ expected for the InferencePool to:
+Additionally, any Pod that seeks to join an InferencePool would need to support the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol), defined by this project, to ensure the Endpoint Picker has adequate information to intelligently route requests.
 
- - Enforce any common set of adapters or base models are available on the Pods
- - Manage Deployments of Pods within the Pool
- - Manage Pod lifecycle of pods within the pool 
+## How to Configure an InferencePool
 
-Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.
+The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool).
 
-`InferencePool` has some small overlap with `Service`, displayed here:
+In summary, the InferencePoolSpec consists of 3 major parts:
+
+- The `selector` field specifies which Pods belong to this pool. The labels in this selector must exactly match the labels applied to your model server Pods. 
+- The `targetPortNumber` field defines the port number that the Inference Gateway should route to on model server Pods that belong to this pool. 
+- The `extensionRef` field references the [endpoint picker extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) (EPP) service that monitors key metrics from model servers within the InferencePool and provides intelligent routing decisions.
+
+### Example Configuration
+
+Here is an example InferencePool configuration:
+
+```
-```
+```yaml
-```
+```yaml
+apiVersion: inference.networking.x-k8s.io/v1alpha2
+kind: InferencePool
+metadata:
+  name: vllm-llama3-8b-instruct
+spec:
+  targetPortNumber: 8000
+  selector:
+    app: vllm-llama3-8b-instruct
+  extensionRef:
+    name: vllm-llama3-8b-instruct-epp
+    port: 9002
+    failureMode: FailClose
+```
+
+In this example: 
+
+- An InferencePool named `vllm-llama3-8b-instruct` is created in the `default` namespace.
+- It will select Pods that have the label `app: vllm-llama3-8b-instruct`.
+- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped.
- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped.
+- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped due to "FailClose" being configured as the `failureMode`.
- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped.
+- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped due to "FailClose" being configured as the `failureMode`.
+- Traffic routed to this InferencePool will be forwarded to the port `8000` on the selected Pods.
- Traffic routed to this InferencePool will be forwarded to the port `8000` on the selected Pods.
+- Traffic routed to this InferencePool will be forwarded to port `8000` on the selected Pods.
- Traffic routed to this InferencePool will be forwarded to the port `8000` on the selected Pods.
+- Traffic routed to this InferencePool will be forwarded to port `8000` on the selected Pods.
+
+## Overlap with Service
+
+**InferencePool** has some small overlap with **Service**, displayed here:
 
 <!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
 <img src="/images/inferencepool-vs-service.png" alt="Comparing InferencePool with Service" class="center" width="550" />
 
-The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus less on networking, and more on Pool management. 
-
-## Spec
+The InferencePool is not intended to be a mask of the Service object. It provides a specialized abstraction tailored for managing and routing traffic to groups of LLM model servers, allowing Platform Admins to focus on pool-level management rather than low-level networking details.
 
-The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool).
+## Replacing an InferencePool
+Please refer to the [Replacing an InferencePool](/guides/replacing-inference-pool) guide for details on uses cases and how to replace an InferencePool.
diff --git a/site-src/guides/replacing-inference-pool.md b/site-src/guides/replacing-inference-pool.md
@@ -0,0 +1,59 @@
+# Replacing an InferencePool
+
+## Background
+
+Replacing an InferencePool is a powerful technique for performing various infrastructure and model updates with minimal disruption and built-in rollback capabilities. This method allows you to introduce changes incrementally, monitor their impact, and revert to the previous state if necessary. 
+
+## Use Cases
+Use Cases for Replacing an InferencePool:
+
+- Upgrading or replacing your model server framework
+- Upgrading or replacing your base model
+- Transitioning to new hardware
+
+## How to replace an InferencePool
+
+To replacing an InferencePool:
+
+1. **Deploy new infrastructure**: Create a new InferencePool configured with the new hardware / model server / base model that you chose.
+1. **Configure traffic splitting**: Use an HTTPRoute to split traffic between the existing InferencePool and the new InferencePool. The `backendRefs.weight` field controls the traffic percentage allocated to each pool.
+1. **Maintain InferenceModel integrity**: Keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions.
+1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary.
+
+### Example
+
+You start with an existing lnferencePool named `llm-pool-v1`. To replace the original InferencePool, you create a new InferencePool named `llm-pool-v2`. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original `llm-pool-v1` and new `llm-pool-v2`. 
+
+1. Save the following sample manifest as `httproute.yaml`:
+
+    ```yaml
+    apiVersion: gateway.networking.k8s.io/v1
+    kind: HTTPRoute
+    metadata:
+      name: llm-route
+    spec:
+      parentRefs:
+      - group: gateway.networking.k8s.io
+        kind: Gateway
+        name: inference-gateway
+      rules:
+        backendRefs:
+        - group: inference.networking.x-k8s.io
+          kind: InferencePool
+          name: llm-pool-v1
+          weight: 90
+        - group: inference.networking.x-k8s.io
+          kind: InferencePool
+          name: llm-pool-v2
+          weight: 10
+    ```
+
+1. Apply the sample manifest to your cluster:
+
+    ```
+    kubectl apply -f httproute.yaml
+    ```
+
+    The original `llm-pool-v1` InferencePool receives most of the traffic, while the `llm-pool-v2` InferencePool receives the rest. 
+
+1. Increase the traffic weight gradually for the `llm-pool-v2` InferencePool to complete the new InferencePool roll out.