You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a section to LMEval doc explaining the results.json output (#49)
* Add a section to LMEval doc explaining the results.json output
* Address review; add details on YAML of LMEvalJob
* Add further details to clarify cancelled/complete state and reason
Copy file name to clipboardExpand all lines: docs/modules/ROOT/pages/lm-eval-tutorial.adoc
+194-3
Original file line number
Diff line number
Diff line change
@@ -141,9 +141,9 @@ Here are the example results:
141
141
}
142
142
----
143
143
144
-
The `f1_micro`, `f1_macro`, and `accuracy` scores are 0.56, 0.36, and 0.56. The full results are stored in the `.status.results` of the `LMEvalJob` object as a JSON document. The command above only retrieves the `results` field of the JSON document.
144
+
The `f1_micro`, `f1_macro`, and `accuracy` scores are 0.56, 0.36, and 0.56. The full results are stored in the `.status.results` of the `LMEvalJob` object as a JSON document. The command above only retrieves the `results` field of the JSON document. See <<output>> for more details.
145
145
146
-
== Details of LMEvalJob
146
+
== Details of LMEvalJob [[crd]]
147
147
148
148
In this section, let's review each property in the LMEvalJob and its usage.
149
149
@@ -270,6 +270,197 @@ Specify extra information for the lm-eval job's pod.
270
270
|Mount a PVC as the local storage for models and datasets.
271
271
|===
272
272
273
+
== Output of LMEvalJob [[output]]
274
+
275
+
The output of an LMEvalJob is a YAML document with several fields. The `status` section provides the relevant information about the current status and, if the job successfully completes, the results for an evaluation.
<1> A `message` provides an explanation related to the current or final status of an LMEvalJob. If the job reason is `Failed`, the related error message will be shown here.
320
+
<2> A one-word `reason` that corresponds to the given `state` of the job at this time. Possible values are:
321
+
322
+
* `NoReason`: The job is still running
323
+
* `Succeeded`: The job finished successfully
324
+
* `Failed`: The job failed
325
+
* `Cancelled`: The job was cancelled
326
+
327
+
<3> The `results` field is the direct output of an `lm-evaluation-harness` run. It has been omitted here to avoid repetition. The link:#output[next section] gives an example of the contents of this section. This section will be empty if the job is not completed.
328
+
<4> The current `state` of this job. The `reason` for a particular state is given in the `reason` field. Possible values are:
329
+
330
+
* `New`: The job was just created
331
+
* `Scheduled`: The job is scheduled and waiting for available resources to run
332
+
* `Running`: The job is currently running
333
+
* `Complete`: The job is complete. This may correspond to either the `Succeeded`, `Failed`, or `Cancelled` reason.
334
+
* `Cancelled`: Job cancellation has been initiated. The `state` will update to `Complete` once the cancellation has been processed by the job controller.
335
+
* `Suspended`: The job has been suspended
336
+
337
+
=== `results` section
338
+
339
+
The `results` field is the direct output of an `lm-evaluation-harness` run. Below is an example of the file that is returned after an `lm-evaluation-harness` evaluation run and, consequently, the contents of the `results` dictionary of the LMEvalJob output YAML. This file may look slightly different depending on what link:#crd[options] are passed.
340
+
341
+
The example shown here is of a Unitxt task called `tr_0` that corresponds to the custom Unitxt task that is shown in link:#custom_card[this section].
<1> `results` is a dictionary of tasks keyed by task name. For each task, the calculated metrics are shown. These metrics are dependant on the task definition. `results` is a flat dictionary, so if a task has subtasks, they will not be nested under a parent task but are rather their own entry.
455
+
<2> `group_subtasks` is a dictionary of tasks keyed by name with the value for each being a list of strings corresponding to subtasks for this task. `group_subtasks` is empty in this example because there are no subtasks.
456
+
<3> `configs` is a dictionary of tasks keyed by task name that shows the configuration options for each task run. These key-value pairs are provided by the task definition (or default values) and will vary depending on the type of task run.
457
+
<4> `versions` and `n-shot` are flat dictionaries with one key for each task run. The value in the `versions` dictionary is the version of the given task (or 0 by default). The value in the `n-shot` dictionary is the number of few-shot examples that were placed in context when running the task. This information is also available in the `configs` dictionary.
458
+
<5> `higher_is_better` and `n-samples` are dictionaries with one key-dictionary pair for each task run. The former provides information as to whether a higher score is considered better for each metric that was evaluated for that task. The latter gives, for each task, the number of samples used during evaluation. In this example, the `--limit` property was set to 10, making the `effective` number of samples 10.
459
+
<6> `config` is a dictionary that provides key-value pairs corresponding to the evaluation job as a whole. This includes information on the type of model run, the `model_args`, and link:#crd[other settings] used for the run. Many of the values in this dictionary in this example are the default values defined by `lm-evaluation-harness`.
460
+
<7> Given at the very end are three fields describing the start, end, and total evaluation time for this job.
461
+
462
+
The remaining key-value pairs define a variety of environment settings used for this evaluation job.
463
+
273
464
== Examples
274
465
275
466
=== Environment Variables
@@ -313,7 +504,7 @@ Or you can create a secret to store the token and refer the key from the secret
0 commit comments