kubernetes-sigs · liu-cong · Mar 24, 2025
diff --git a/config/manifests/benchmark/model-server-service.yaml b/config/manifests/benchmark/model-server-service.yaml
diff --git a/site-src/performance/benchmark/index.md b/site-src/performance/benchmark/index.md
@@ -5,30 +5,26 @@ inference extension, and a Kubernetes service as the load balancing strategy. Th
 benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
 tool to generate load and collect results.
 
-## Prerequisites
+## Run benchmarks manually
 
-### Deploy the inference extension and sample model server
+### Prerequisite: have an endpoint ready to server inference traffic
 
-Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
-sample vLLM application, and the inference extension.
+To serve via a Gateway using the inference extension, follow this [user guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/)
+to deploy the sample vLLM application, and the inference extension.
 
-### [Optional] Scale the sample vLLM deployment
-
-You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. 
+You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. So consider scaling the sample application with more replicas:
 
 ```bash
 kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
 ```
 
-### Expose the model server via a k8s service
-
-As the baseline, let's also expose the vLLM deployment as a k8s service:
+To serve via a Kubernetes LoadBalancer service as a baseline comparison, you can expose the sample application:
 
 ```bash
 kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
 ```
 
-## Run benchmark
+### Run benchmark
 
 The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
 
@@ -60,18 +56,24 @@ to specify what this benchmark is for. For instance, `inference-extension` or `k
 the script below will watch for that log line and then start downloading results.
 
     ```bash
-    benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
+    benchmark_id='my-benchmark' ./tools/benchmark/scripts/download-benchmark-results.bash
     ```
 
 1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder.
 
-### Tips
+#### Tips
 
-* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
+* You can specify `run_id="runX"` environment variable when running the `download-benchmark-results.bash` script.
 This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
 * Update the `request_rates` that best suit your benchmark environment.
 
-### Advanced Benchmark Configurations
+## Run benchmarks automatically
+
+The [benchmark automation tool](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/tools/benchmark) enables defining benchmarks via a config file and running the benchmarks
+automatically. It's currently experimental. To try it, refer to its [user guide](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/tools/benchmark).
+
+
+## Advanced Benchmark Configurations
 
 Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.
 

diff --git a/tools/benchmark/.gitignore b/tools/benchmark/.gitignore
@@ -0,0 +1 @@
+output/
diff --git a/tools/benchmark/README.md b/tools/benchmark/README.md
@@ -1 +1,194 @@
-This folder contains resources to run performance benchmarks. Pls follow the benchmark guide here https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark.
+This folder contains resources to run performance benchmarks. Pls follow the benchmark guide here https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark.
+
+## Features
+
+1. **Config driven benchmarks**. Use the `./proto/benchmark.proto` API to write benchmark configurations, without the need to craft complex yamls.
+2. **Reproducibility**. The tool will snapshot all the manifests needed for the benchmark run and mark them immutable (unless the user explicitly overrides it). 
+3. **Benchmark inheritance**. Extend an existing benchmark configuration by overriding a subset of parameters, instead of re-writing everything from scratch.
+4. **Benchmark orchestration**. The tool automatically deploys benchmark environment into a cluster, and waits to collects results, and then tears down the environment. The tool deploys the benchmark resources in new namespaces so each benchmark runs independently.
+5. **Auto generated request rate**. The tool can automatically generate request rates for known models and accelerators to cover a wide range of model server load from low latency to fully saturated throughput.
+6. **Visulization tools**. The results can be analyzed with a jupyter notebook.
+7. **Model server metrics**. The tool uses the latency profile generator benchmark tool to scrape metrics from Google Cloud monitoring. It also provides a link to a Google Cloud monitoring dashboard for detailed analysis.
+
+### Future Improvements
+
+1. The benchmark config and results are stored in protobuf format. The results can be persisted in a database such as Google Cloud Spanner to allow complex query and dashboarding use cases.
+2. Support running benchmarks in parallel with user configured parallelism.
+
+## Prerequisite
+
+1. [Install helm](https://helm.sh/docs/intro/quickstart/#install-helm)
+2. Install InferenceModel and InferencePool [CRDs](https://gateway-api-inference-extension.sigs.k8s.io/guides/#install-the-inference-extension-crds) 
+3. [Enable Envoy patch policy](https://gateway-api-inference-extension.sigs.k8s.io/guides/#update-envoy-gateway-config-to-enable-patch-policy).
+4. Install [RBACs](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/12bcc9a85dad828b146758ad34a69053dca44fa9/config/manifests/inferencepool.yaml#L78) for EPP to read pods.
+5. Create a secret in the default namespace containing the HuggingFace token. 
+
+   ```bash
+   kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2
+   ```
+
+6. [Optional, GCP only] Create a `gmp-test-sa` service account with `monitoring.Viewer` role to read additional model server metrics from cloud monitoring. 
+
+   ```bash
+    gcloud iam service-accounts create gmp-test-sa \
+    &&
+    gcloud projects add-iam-policy-binding ${BENCHMARK_PROJECT} \
+    --member=serviceAccount:gmp-test-sa@${BENCHMARK_PROJECT}.iam.gserviceaccount.com \
+    --role=roles/monitoring.viewer
+   ```
+
+## Get started
+
+Run all existing benchmarks:
+
+```bash
+# Run all benchmarks in the ./catalog/benchmark folder
+./scripts/run_all_benchmarks.bash
+```
+
+View the benchmark results:
+
+* To view raw results, watch for a new results folder to be created `./output/{run_id}/`. 
+* To visualize the results, use the jupyter notebook.
+
+## Common usage
+
+### Run all benchmarks in a particular benchmark config file and upload results to GCS
+
+```bash
+gcs_bucket='my-bucket' benchmarks=benchmarks ./scripts/run_benchmarks_file.bash
+```
+
+### Generate benchmark manifests only
+
+```bash
+# All available environment variables.
+benchmarks=benchmarks ./scripts/generate_manifests.bash
+```
+
+### Run particular benchmarks in a benchmark config file, by matching a benchmark name refex
+
+```bash
+# Run all benchmarks with Nvidia H100
+gcs_bucket='my-bucket' benchmarks=benchmarks benchmark_name_regex='.*h100.*'  ./scripts/run_benchmarks_file.bash
+```
+
+### Resume a benchmark run from an existing run_id
+
+You may resume benchmarks from previously generated manifests. The tool will skip benchmarks which have the `results` folder, and continue those without results.
+
+```bash
+run_id='existing-run-id' benchmarks=benchmarks ./scripts/run_benchmarks_file.bash
+```
+
+### Keep the benchmark environment after benchmark is complete (for debugging)
+
+```bash
+# All available environment variables.
+skip_tear_down='true' benchmarks=benchmarks  ./scripts/run_benchmarks_file.bash
+```
+
+## Command references
+
+```bash
+# All available environment variables
+regex='my-benchmark-file-name-regex' dry_run='false'  gcs_bucket='my-bucket' skip_tear_down='false' benchmark_name_regex='my-benchmark-name-regex' ./scripts/run_all_benchmarks.bash
+```
+
+```bash
+# All available environment variables.
+run_id='existing-run-id' dry_run='false'  gcs_bucket='my-bucket' skip_tear_down='false' benchmarks=benchmarks benchmark_name_regex='my-benchmark-name-regex'  ./scripts/run_benchmarks_file.bash
+```
+
+```bash
+# All available environment variables.
+run_id='existing-run-id' benchmarks=benchmarks ./scripts/generate_manifests.bash
+```
+
+## How does it work?
+
+The tool will automate the following steps:
+
+1. Read the benchmark config file in `./catalog/{benchmarks_config_file}`. The file contains a list of benchmarks. The config API is defined in `./proto/benchmark.proto`.
+2. Generates a new run_id and namespace `{benchmark_name}-{run_id}` to run the benchmarks. If the `run_id` environment variable is provided, it will reuse it instead of creating a new one. This is useful when resuming a previous benchmark run, or run multiple sets of benchmarks in parallel (e.g., run benchmarks on different accelerator types in parallel using the same run_id).
+3. Based on the config, generates manifests in `./output/{run_id}/{benchmark_name}-{run_id}/manifests`
+4. Applies the manifests to the cluster, and wait for resources to be ready.
+5. Once the benchmark finishes, downloads benchmark results to `./output/{run_id}/{benchmark}-{run_id}/results`
+6. [Optional] If a GCS bucket is specified, uploads the output folder to a GCS bucket.
+
+## Create a new benchmark
+
+You can either add new benchmarks to an existing benchmark config file, or create new benchmark config files. Each benchmark config file contains a list of benchmarks.
+
+An example benchmark with all available parameters is as follows:
+
+```
+benchmarks {
+    name: "base-benchmark"
+    config {
+        model_server {
+            image: "vllm/vllm-openai@sha256:8672d9356d4f4474695fd69ef56531d9e482517da3b31feb9c975689332a4fb0"
+            accelerator: "nvidia-h100-80gb"
+            replicas: 1
+            vllm {
+                tensor_parallelism: "1"
+                model: "meta-llama/Llama-2-7b-hf"
+            }
+        }
+        load_balancer {
+            gateway {
+                envoy {
+                    epp {
+                        image: "us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.1.0"
+                    }
+                }
+            }
+        }
+        benchmark_tool {
+            image: "us-docker.pkg.dev/gke-inference-gateway-dev/benchmark/benchmark-tool@sha256:1fe4991ec1e9379b261a62631e1321b8ea15772a6d9a74357932771cea7b0500"
+            lpg {
+                dataset: "sharegpt_v3_unfiltered_cleaned_split"
+                models: "meta-llama/Llama-2-7b-hf"
+                ip: "to-be-populated-automatically"
+                port: "8081"
+                benchmark_time_seconds: "60"
+                output_length: "1024"
+            }
+        }
+    }
+}
+```
+
+### Create a benchmark from a base benchmark
+
+It's recommended to create a benchmark from an existing benchmark by overriding a few parameters. This inheritance feature is powerful in creating a large number of benchmarks conveniently. Below is an example that overrides the replica count of a base benchmark:
+
+```
+benchmarks {
+    name: "new-benchmark"
+    base_benchmark_name: "base-benchmark"
+    config {
+        model_server {
+            replicas: 2
+        }
+    }
+}
+```
+
+## Environment configurations
+
+The tool has default configurations (such as the cluster name) in `./scripts/env.sh`. You can tweak those for your own needs.
+
+## The benchmark.proto
+
+The `./proto/benchmark.proto` is the core of this tool, it drives the generation of the benchmark manifests, as well as the query and dashboarding of the results.
+
+Why do we need it?
+
+* An API to clearly capture the intent, instead of making various assumptions.
+* It lets the user to focus only on the core parameters of the benchmark itself, rather than the toil of configuring the environment and crafting the manifests.
+* It is the single source of truth that drives the entre lifecycle of the benchmark, including post analysis.
+
+## Contribute
+
+Refer to the [dev guide](./dev.md).
diff --git a/tools/benchmark/catalog/base-model.pbtxt b/tools/benchmark/catalog/base-model.pbtxt
@@ -0,0 +1,66 @@
+
+# proto file: proto/benchmark.proto
+# proto message: Benchmarks
+
+benchmarks {
+    name: "r8-svc-vllmv1"
+    config {
+        model_server {
+            image: "vllm/vllm-openai:v0.8.1"
+            accelerator: "nvidia-h100-80gb"
+            replicas: 8
+            vllm {
+                tensor_parallelism: "1"
+                model: "meta-llama/Llama-2-7b-hf"
+                v1: "1"
+            }
+        }
+        load_balancer {
+            k8s_service {}
+        }
+        benchmark_tool {
+            # The following image was built from this source https://github.com/AI-Hypercomputer/inference-benchmark/tree/07628c9fe01b748f5a4cc9e5c2ee4234aaf47699
+            image: 'us-docker.pkg.dev/cloud-tpu-images/inference/inference-benchmark@sha256:1c100b0cc949c7df7a2db814ae349c790f034b4b373aaad145e77e815e838438'
+            lpg {
+                dataset: "sharegpt_v3_unfiltered_cleaned_split"
+                models: "meta-llama/Llama-2-7b-hf"
+                tokenizer: "meta-llama/Llama-2-7b-hf"
+                ip: "to-be-populated-automatically"
+                port: "8081"
+                benchmark_time_seconds: "100"
+                output_length: "2048"
+
+            }
+        }
+    }
+}
+
+benchmarks {
+    name: "r8-epp-vllmv1"
+    base_benchmark_name: "r8-svc-vllmv1"
+    config {
+        load_balancer {
+            gateway {
+                envoy {
+                    epp {
+                        image: "us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main"
+                        refresh_metrics_interval: "50ms"
+                    }
+                }
+                full_duplex_streaming_enabled: true
+            }
+        }
+    }
+}
+
+benchmarks {
+    name: "r8-epp-no-streaming-vllmv1"
+    base_benchmark_name: "r8-epp-vllmv1"
+    config {
+        load_balancer {
+            gateway {
+                full_duplex_streaming_enabled: false
+            }
+        }
+    }
+}
diff --git a/tools/benchmark/catalog/charts/BenchmarkTool/.helmignore b/tools/benchmark/catalog/charts/BenchmarkTool/.helmignore
@@ -0,0 +1,23 @@
+# Patterns to ignore when building packages.
+# This supports shell glob matching, relative path matching, and
+# negation (prefixed with !). Only one pattern per line.
+.DS_Store
+# Common VCS dirs
+.git/
+.gitignore
+.bzr/
+.bzrignore
+.hg/
+.hgignore
+.svn/
+# Common backup files
+*.swp
+*.bak
+*.tmp
+*.orig
+*~
+# Various IDEs
+.project
+.idea/
+*.tmproj
+.vscode/
diff --git a/tools/benchmark/catalog/charts/BenchmarkTool/Chart.yaml b/tools/benchmark/catalog/charts/BenchmarkTool/Chart.yaml
@@ -0,0 +1,24 @@
+apiVersion: v2
+name: BenchmarkTool
+description: A Helm chart for Kubernetes
+
+# A chart can be either an 'application' or a 'library' chart.
+#
+# Application charts are a collection of templates that can be packaged into versioned archives
+# to be deployed.
+#
+# Library charts provide useful utilities or functions for the chart developer. They're included as
+# a dependency of application charts to inject those utilities and functions into the rendering
+# pipeline. Library charts do not define any templates and therefore cannot be deployed.
+type: application
+
+# This is the chart version. This version number should be incremented each time you make changes
+# to the chart and its templates, including the app version.
+# Versions are expected to follow Semantic Versioning (https://semver.org/)
+version: 0.1.0
+
+# This is the version number of the application being deployed. This version number should be
+# incremented each time you make changes to the application. Versions are not expected to
+# follow Semantic Versioning. They should reflect the version the application is using.
+# It is recommended to use it with quotes.
+appVersion: "1.16.0"