Open
Description
Question
Hello!
What's the recommendation for implementing custom aggregate metrics like precision/recall for evals?
There's an existing ReportCaseAggregate but that seems specific for calculating the average of scores.
There's a few workarounds on top of my head:
- Implement my own
EvaluationReport
and overwrite the dataset'sevaluate()
function to call an aggregate report - Write my own custom script to calculate metrics from the list of cases (ignoring any Evaluator)
Ideally, it seems we should be able to have an Evaluator compute
function that runs on a list of predictions and labels, similar to sklearn's precision_score(y_true, y_pred)
or huggingface's evaluate's compute(predictions, references)
Additional Context
No response