Skip to content

Add benchmark automation tool #563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 0 additions & 12 deletions config/manifests/benchmark/model-server-service.yaml

This file was deleted.

32 changes: 17 additions & 15 deletions site-src/performance/benchmark/index.md
Original file line number Diff line number Diff line change
@@ -5,30 +5,26 @@ inference extension, and a Kubernetes service as the load balancing strategy. Th
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
tool to generate load and collect results.

## Prerequisites
## Run benchmarks manually

### Deploy the inference extension and sample model server
### Prerequisite: have an endpoint ready to server inference traffic

Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
sample vLLM application, and the inference extension.
To serve via a Gateway using the inference extension, follow this [user guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/)
to deploy the sample vLLM application, and the inference extension.

### [Optional] Scale the sample vLLM deployment

You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. So consider scaling the sample application with more replicas:

```bash
kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
```

### Expose the model server via a k8s service

As the baseline, let's also expose the vLLM deployment as a k8s service:
To serve via a Kubernetes LoadBalancer service as a baseline comparison, you can expose the sample application:

```bash
kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
```

## Run benchmark
### Run benchmark

The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.

@@ -60,18 +56,24 @@ to specify what this benchmark is for. For instance, `inference-extension` or `k
the script below will watch for that log line and then start downloading results.

```bash
benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
benchmark_id='my-benchmark' ./tools/benchmark/scripts/download-benchmark-results.bash
```

1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder.

### Tips
#### Tips

* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
* You can specify `run_id="runX"` environment variable when running the `download-benchmark-results.bash` script.
This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
* Update the `request_rates` that best suit your benchmark environment.

### Advanced Benchmark Configurations
## Run benchmarks automatically

The [benchmark automation tool](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/tools/benchmark) enables defining benchmarks via a config file and running the benchmarks
automatically. It's currently experimental. To try it, refer to its [user guide](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/tools/benchmark).


## Advanced Benchmark Configurations

Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.

1 change: 1 addition & 0 deletions tools/benchmark/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
output/
195 changes: 194 additions & 1 deletion tools/benchmark/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,194 @@
This folder contains resources to run performance benchmarks. Pls follow the benchmark guide here https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark.
This folder contains resources to run performance benchmarks. Pls follow the benchmark guide here https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark.

## Features

1. **Config driven benchmarks**. Use the `./proto/benchmark.proto` API to write benchmark configurations, without the need to craft complex yamls.
2. **Reproducibility**. The tool will snapshot all the manifests needed for the benchmark run and mark them immutable (unless the user explicitly overrides it).
3. **Benchmark inheritance**. Extend an existing benchmark configuration by overriding a subset of parameters, instead of re-writing everything from scratch.
4. **Benchmark orchestration**. The tool automatically deploys benchmark environment into a cluster, and waits to collects results, and then tears down the environment. The tool deploys the benchmark resources in new namespaces so each benchmark runs independently.
5. **Auto generated request rate**. The tool can automatically generate request rates for known models and accelerators to cover a wide range of model server load from low latency to fully saturated throughput.
6. **Visulization tools**. The results can be analyzed with a jupyter notebook.
7. **Model server metrics**. The tool uses the latency profile generator benchmark tool to scrape metrics from Google Cloud monitoring. It also provides a link to a Google Cloud monitoring dashboard for detailed analysis.

### Future Improvements

1. The benchmark config and results are stored in protobuf format. The results can be persisted in a database such as Google Cloud Spanner to allow complex query and dashboarding use cases.
2. Support running benchmarks in parallel with user configured parallelism.

## Prerequisite

1. [Install helm](https://helm.sh/docs/intro/quickstart/#install-helm)
2. Install InferenceModel and InferencePool [CRDs](https://gateway-api-inference-extension.sigs.k8s.io/guides/#install-the-inference-extension-crds)
3. [Enable Envoy patch policy](https://gateway-api-inference-extension.sigs.k8s.io/guides/#update-envoy-gateway-config-to-enable-patch-policy).
4. Install [RBACs](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/12bcc9a85dad828b146758ad34a69053dca44fa9/config/manifests/inferencepool.yaml#L78) for EPP to read pods.
5. Create a secret in the default namespace containing the HuggingFace token.

```bash
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2
```

6. [Optional, GCP only] Create a `gmp-test-sa` service account with `monitoring.Viewer` role to read additional model server metrics from cloud monitoring.

```bash
gcloud iam service-accounts create gmp-test-sa \
&&
gcloud projects add-iam-policy-binding ${BENCHMARK_PROJECT} \
--member=serviceAccount:gmp-test-sa@${BENCHMARK_PROJECT}.iam.gserviceaccount.com \
--role=roles/monitoring.viewer
```

## Get started

Run all existing benchmarks:

```bash
# Run all benchmarks in the ./catalog/benchmark folder
./scripts/run_all_benchmarks.bash
```

View the benchmark results:

* To view raw results, watch for a new results folder to be created `./output/{run_id}/`.
* To visualize the results, use the jupyter notebook.

## Common usage

### Run all benchmarks in a particular benchmark config file and upload results to GCS

```bash
gcs_bucket='my-bucket' benchmarks=benchmarks ./scripts/run_benchmarks_file.bash
```

### Generate benchmark manifests only

```bash
# All available environment variables.
benchmarks=benchmarks ./scripts/generate_manifests.bash
```

### Run particular benchmarks in a benchmark config file, by matching a benchmark name refex

```bash
# Run all benchmarks with Nvidia H100
gcs_bucket='my-bucket' benchmarks=benchmarks benchmark_name_regex='.*h100.*' ./scripts/run_benchmarks_file.bash
```

### Resume a benchmark run from an existing run_id

You may resume benchmarks from previously generated manifests. The tool will skip benchmarks which have the `results` folder, and continue those without results.

```bash
run_id='existing-run-id' benchmarks=benchmarks ./scripts/run_benchmarks_file.bash
```

### Keep the benchmark environment after benchmark is complete (for debugging)

```bash
# All available environment variables.
skip_tear_down='true' benchmarks=benchmarks ./scripts/run_benchmarks_file.bash
```

## Command references

```bash
# All available environment variables
regex='my-benchmark-file-name-regex' dry_run='false' gcs_bucket='my-bucket' skip_tear_down='false' benchmark_name_regex='my-benchmark-name-regex' ./scripts/run_all_benchmarks.bash
```

```bash
# All available environment variables.
run_id='existing-run-id' dry_run='false' gcs_bucket='my-bucket' skip_tear_down='false' benchmarks=benchmarks benchmark_name_regex='my-benchmark-name-regex' ./scripts/run_benchmarks_file.bash
```

```bash
# All available environment variables.
run_id='existing-run-id' benchmarks=benchmarks ./scripts/generate_manifests.bash
```

## How does it work?

The tool will automate the following steps:

1. Read the benchmark config file in `./catalog/{benchmarks_config_file}`. The file contains a list of benchmarks. The config API is defined in `./proto/benchmark.proto`.
2. Generates a new run_id and namespace `{benchmark_name}-{run_id}` to run the benchmarks. If the `run_id` environment variable is provided, it will reuse it instead of creating a new one. This is useful when resuming a previous benchmark run, or run multiple sets of benchmarks in parallel (e.g., run benchmarks on different accelerator types in parallel using the same run_id).
3. Based on the config, generates manifests in `./output/{run_id}/{benchmark_name}-{run_id}/manifests`
4. Applies the manifests to the cluster, and wait for resources to be ready.
5. Once the benchmark finishes, downloads benchmark results to `./output/{run_id}/{benchmark}-{run_id}/results`
6. [Optional] If a GCS bucket is specified, uploads the output folder to a GCS bucket.

## Create a new benchmark

You can either add new benchmarks to an existing benchmark config file, or create new benchmark config files. Each benchmark config file contains a list of benchmarks.

An example benchmark with all available parameters is as follows:

```
benchmarks {
name: "base-benchmark"
config {
model_server {
image: "vllm/vllm-openai@sha256:8672d9356d4f4474695fd69ef56531d9e482517da3b31feb9c975689332a4fb0"
accelerator: "nvidia-h100-80gb"
replicas: 1
vllm {
tensor_parallelism: "1"
model: "meta-llama/Llama-2-7b-hf"
}
}
load_balancer {
gateway {
envoy {
epp {
image: "us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v0.1.0"
}
}
}
}
benchmark_tool {
image: "us-docker.pkg.dev/gke-inference-gateway-dev/benchmark/benchmark-tool@sha256:1fe4991ec1e9379b261a62631e1321b8ea15772a6d9a74357932771cea7b0500"
lpg {
dataset: "sharegpt_v3_unfiltered_cleaned_split"
models: "meta-llama/Llama-2-7b-hf"
ip: "to-be-populated-automatically"
port: "8081"
benchmark_time_seconds: "60"
output_length: "1024"
}
}
}
}
```

### Create a benchmark from a base benchmark

It's recommended to create a benchmark from an existing benchmark by overriding a few parameters. This inheritance feature is powerful in creating a large number of benchmarks conveniently. Below is an example that overrides the replica count of a base benchmark:

```
benchmarks {
name: "new-benchmark"
base_benchmark_name: "base-benchmark"
config {
model_server {
replicas: 2
}
}
}
```

## Environment configurations

The tool has default configurations (such as the cluster name) in `./scripts/env.sh`. You can tweak those for your own needs.

## The benchmark.proto

The `./proto/benchmark.proto` is the core of this tool, it drives the generation of the benchmark manifests, as well as the query and dashboarding of the results.

Why do we need it?

* An API to clearly capture the intent, instead of making various assumptions.
* It lets the user to focus only on the core parameters of the benchmark itself, rather than the toil of configuring the environment and crafting the manifests.
* It is the single source of truth that drives the entre lifecycle of the benchmark, including post analysis.

## Contribute

Refer to the [dev guide](./dev.md).
66 changes: 66 additions & 0 deletions tools/benchmark/catalog/base-model.pbtxt
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@

# proto file: proto/benchmark.proto
# proto message: Benchmarks

benchmarks {
name: "r8-svc-vllmv1"
config {
model_server {
image: "vllm/vllm-openai:v0.8.1"
accelerator: "nvidia-h100-80gb"
replicas: 8
vllm {
tensor_parallelism: "1"
model: "meta-llama/Llama-2-7b-hf"
v1: "1"
}
}
load_balancer {
k8s_service {}
}
benchmark_tool {
# The following image was built from this source https://github.com/AI-Hypercomputer/inference-benchmark/tree/07628c9fe01b748f5a4cc9e5c2ee4234aaf47699
image: 'us-docker.pkg.dev/cloud-tpu-images/inference/inference-benchmark@sha256:1c100b0cc949c7df7a2db814ae349c790f034b4b373aaad145e77e815e838438'
lpg {
dataset: "sharegpt_v3_unfiltered_cleaned_split"
models: "meta-llama/Llama-2-7b-hf"
tokenizer: "meta-llama/Llama-2-7b-hf"
ip: "to-be-populated-automatically"
port: "8081"
benchmark_time_seconds: "100"
output_length: "2048"

}
}
}
}

benchmarks {
name: "r8-epp-vllmv1"
base_benchmark_name: "r8-svc-vllmv1"
config {
load_balancer {
gateway {
envoy {
epp {
image: "us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main"
refresh_metrics_interval: "50ms"
}
}
full_duplex_streaming_enabled: true
}
}
}
}

benchmarks {
name: "r8-epp-no-streaming-vllmv1"
base_benchmark_name: "r8-epp-vllmv1"
config {
load_balancer {
gateway {
full_duplex_streaming_enabled: false
}
}
}
}
23 changes: 23 additions & 0 deletions tools/benchmark/catalog/charts/BenchmarkTool/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
24 changes: 24 additions & 0 deletions tools/benchmark/catalog/charts/BenchmarkTool/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: v2
name: BenchmarkTool
description: A Helm chart for Kubernetes

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.16.0"
Loading