Calculating custom aggregate metrics for evals

### Question

Hello!

What's the recommendation for implementing custom aggregate metrics like precision/recall for evals?
There's an existing [ReportCaseAggregate](https://github.com/pydantic/pydantic-ai/blob/main/pydantic_evals/pydantic_evals/reporting/__init__.py#L68) but that seems specific for calculating the average of scores.

There's a few workarounds on top of my head:

1. Implement my own `EvaluationReport` and overwrite the dataset's `evaluate()` function to call an aggregate report
2. Write my own custom script to calculate metrics from the list of cases (ignoring any Evaluator)

Ideally, it seems we should be able to have an Evaluator `compute` function that runs on a list of predictions and labels, similar to sklearn's `precision_score(y_true, y_pred)` or huggingface's evaluate's `compute(predictions, references)`

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculating custom aggregate metrics for evals #1413

Question

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Calculating custom aggregate metrics for evals #1413

Description

Question

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions