Add a section to LMEval doc explaining the results.json output (#49)

kiersten-stokes · web-flow · commit 0464feb8ce88 · 2025-03-04T10:18:10.000Z
* Add a section to LMEval doc explaining the results.json output

* Address review; add details on YAML of LMEvalJob

* Add further details to clarify cancelled/complete state and reason
diff --git a/docs/modules/ROOT/pages/lm-eval-tutorial.adoc b/docs/modules/ROOT/pages/lm-eval-tutorial.adoc
@@ -141,9 +141,9 @@ Here are the example results:
 }
 ----
 
-The `f1_micro`, `f1_macro`, and `accuracy` scores are 0.56, 0.36, and 0.56. The full results are stored in the `.status.results` of the `LMEvalJob` object as a JSON document. The command above only retrieves the `results` field of the JSON document.
+The `f1_micro`, `f1_macro`, and `accuracy` scores are 0.56, 0.36, and 0.56. The full results are stored in the `.status.results` of the `LMEvalJob` object as a JSON document. The command above only retrieves the `results` field of the JSON document. See <<output>> for more details.
 
-== Details of LMEvalJob
+== Details of LMEvalJob [[crd]]
 
 In this section, let's review each property in the LMEvalJob and its usage.
 
@@ -270,6 +270,197 @@ Specify extra information for the lm-eval job's pod.
 |Mount a PVC as the local storage for models and datasets.
 |===
 
+== Output of LMEvalJob [[output]]
+
+The output of an LMEvalJob is a YAML document with several fields. The `status` section provides the relevant information about the current status and, if the job successfully completes, the results for an evaluation.
+
+[source,yaml]
+----
+apiVersion: trustyai.opendatahub.io/v1alpha1
+kind: LMEvalJob
+metadata:
+  annotations:
+    kubectl.kubernetes.io/last-applied-configuration: |
+      {"apiVersion":"trustyai.opendatahub.io/v1alpha1","kind":"LMEvalJob","metadata":{"annotations":{},"name":"lmeval-test","namespace":"test"},"spec":{"allowCodeExecution":true,"allowOnline":true,"logSamples":true,"model":"hf","modelArgs":[{"name":"pretrained","value":"google/flan-t5-base"}],"taskList":{"taskRecipes":[{"card":{"name":"cards.wnli"},"template":"templates.classification.multi_class.relation.default"}]}}}
+  creationTimestamp: "2025-02-06T18:13:35Z"
+  finalizers:
+  - trustyai.opendatahub.io/lmes-finalizer
+  generation: 1
+  name: lmeval-test
+  namespace: test
+  resourceVersion: "19604113"
+  uid: e1d29da2-bf3e-4f46-8907-6018e5741eb4
+spec:
+  allowCodeExecution: true
+  allowOnline: true
+  logSamples: true
+  model: hf
+  modelArgs:
+  - name: pretrained
+    value: google/flan-t5-base
+  taskList:
+    taskRecipes:
+    - card:
+        name: cards.wnli
+      template: templates.classification.multi_class.relation.default
+status:
+  completeTime: "2025-02-06T18:31:20Z"
+  lastScheduleTime: "2025-02-06T18:13:35Z"
+  message: job completed <1>
+  podName: lmeval-test
+  reason: Succeeded <2>
+  results: |- <3>
+    {
+      ...
+    }
+  state: Complete <4>
+----
+
+<1> A `message` provides an explanation related to the current or final status of an LMEvalJob. If the job reason is `Failed`, the related error message will be shown here.
+<2> A one-word `reason` that corresponds to the given `state` of the job at this time. Possible values are:
+
+  * `NoReason`: The job is still running
+  * `Succeeded`: The job finished successfully
+  * `Failed`: The job failed
+  * `Cancelled`: The job was cancelled
+
+<3> The `results` field is the direct output of an `lm-evaluation-harness` run. It has been omitted here to avoid repetition. The link:#output[next section] gives an example of the contents of this section. This section will be empty if the job is not completed.
+<4> The current `state` of this job. The `reason` for a particular state is given in the `reason` field. Possible values are: 
+
+  * `New`: The job was just created
+  * `Scheduled`: The job is scheduled and waiting for available resources to run
+  * `Running`: The job is currently running
+  * `Complete`: The job is complete. This may correspond to either the `Succeeded`, `Failed`, or `Cancelled` reason.
+  * `Cancelled`: Job cancellation has been initiated. The `state` will update to `Complete` once the cancellation has been processed by the job controller.
+  * `Suspended`: The job has been suspended
+
+=== `results` section
+
+The `results` field is the direct output of an `lm-evaluation-harness` run. Below is an example of the file that is returned after an `lm-evaluation-harness` evaluation run and, consequently, the contents of the `results` dictionary of the LMEvalJob output YAML. This file may look slightly different depending on what link:#crd[options] are passed.
+
+The example shown here is of a Unitxt task called `tr_0` that corresponds to the custom Unitxt task that is shown in link:#custom_card[this section].
+
+[source,json]
+----
+{
+  "results": { <1>
+    "tr_0": {
+      "alias": "tr_0",
+      "f1_micro,none": 0.5,
+      "f1_micro_stderr,none": "N/A",
+      "accuracy,none": 0.5,
+      "accuracy_stderr,none": "N/A",
+      "f1_macro,none": 0.3333333333333333,
+      "f1_macro_stderr,none": "N/A"
+    }
+  },
+  "group_subtasks": { <2>
+    "tr_0": []
+  },
+  "configs": { <3>
+    "tr_0": {
+      "task": "tr_0",
+      "dataset_name": "card=cards.wnli,template=templates.classification.multi_class.relation.default",
+      "unsafe_code": false,
+      "description": "",
+      "target_delimiter": " ",
+      "fewshot_delimiter": "\n\n",
+      "num_fewshot": 0,
+      "output_type": "generate_until",
+      "generation_kwargs": {
+        "until": [
+          "\n\n"
+        ],
+        "do_sample": false
+      },
+      "repeats": 1,
+      "should_decontaminate": false,
+      "metadata": {
+        "version": 0
+      }
+    }
+  },
+  "versions": { <4>
+    "tr_0": 0
+  },
+  "n-shot": { <4>
+    "tr_0": 0
+  },
+  "higher_is_better": { <5>
+    "tr_0": {
+      "f1_micro": true,
+      "accuracy": true,
+      "f1_macro": true
+    }
+  },
+  "n-samples": { <5>
+    "tr_0": {
+      "original": 71,
+      "effective": 10
+    }
+  },
+  "config": { <6>
+    "model": "hf",
+    "model_args": "pretrained=hf_home/flan-t5-base",
+    "model_num_parameters": 247577856,
+    "model_dtype": "torch.float32",
+    "model_revision": "main",
+    "model_sha": "",
+    "batch_size": 1,
+    "batch_sizes": [],
+    "use_cache": null,
+    "limit": 10.0,
+    "bootstrap_iters": 100000,
+    "gen_kwargs": null,
+    "random_seed": 0,
+    "numpy_seed": 1234,
+    "torch_seed": 1234,
+    "fewshot_seed": 1234
+  },
+  "git_hash": "af2d2f3e",
+  "date": 1740763246.8746712,
+  "pretty_env_info": "PyTorch version: 2.5.1\nIs debug build: False\nCUDA used to build PyTorch: None\nROCM used to build PyTorch: N/A\n\nOS: macOS 15.3.1 (arm64)\nGCC version: Could not collect\nClang version: 16.0.0 (clang-1600.0.26.3)\nCMake version: Could not collect\nLibc version: N/A\n\nPython version: 3.11.11 (main, Dec 11 2024, 10:25:04) [Clang 14.0.6 ] (64-bit runtime)\nPython platform: macOS-15.3.1-arm64-arm-64bit\nIs CUDA available: False\nCUDA runtime version: No CUDA\nCUDA_MODULE_LOADING set to: N/A\nGPU models and configuration: No CUDA\nNvidia driver version: No CUDA\ncuDNN version: No CUDA\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nApple M1 Max\n\nVersions of relevant libraries:\n[pip3] mypy==1.15.0\n[pip3] mypy-extensions==1.0.0\n[pip3] numpy==2.2.2\n[pip3] torch==2.5.1\n[conda] numpy                     2.2.2                    pypi_0    pypi\n[conda] torch                     2.5.1                    pypi_0    pypi",
+  "transformers_version": "4.48.1",
+  "upper_git_hash": null,
+  "tokenizer_pad_token": [
+    "<pad>",
+    "0"
+  ],
+  "tokenizer_eos_token": [
+    "</s>",
+    "1"
+  ],
+  "tokenizer_bos_token": [
+    null,
+    "None"
+  ],
+  "eot_token_id": 1,
+  "max_length": 512,
+  "task_hashes": {},
+  "model_source": "hf",
+  "model_name": "hf_home/flan-t5-base",
+  "model_name_sanitized": "hf_home__flan-t5-base",
+  "system_instruction": null,
+  "system_instruction_sha": null,
+  "fewshot_as_multiturn": false,
+  "chat_template": null,
+  "chat_template_sha": null,
+  "start_time": 84598.410512833, <7>
+  "end_time": 84647.782769875,
+  "total_evaluation_time_seconds": "49.37225704200682"
+}
+----
+
+<1> `results` is a dictionary of tasks keyed by task name. For each task, the calculated metrics are shown. These metrics are dependant on the task definition. `results` is a flat dictionary, so if a task has subtasks, they will not be nested under a parent task but are rather their own entry.
+<2> `group_subtasks` is a dictionary of tasks keyed by name with the value for each being a list of strings corresponding to subtasks for this task. `group_subtasks` is empty in this example because there are no subtasks.
+<3> `configs` is a dictionary of tasks keyed by task name that shows the configuration options for each task run. These key-value pairs are provided by the task definition (or default values) and will vary depending on the type of task run.
+<4> `versions` and `n-shot` are flat dictionaries with one key for each task run. The value in the `versions` dictionary is the version of the given task (or 0 by default). The value in the `n-shot` dictionary is the number of few-shot examples that were placed in context when running the task. This information is also available in the `configs` dictionary.
+<5> `higher_is_better` and `n-samples` are dictionaries with one key-dictionary pair for each task run. The former provides information as to whether a higher score is considered better for each metric that was evaluated for that task. The latter gives, for each task, the number of samples used during evaluation. In this example, the `--limit` property was set to 10, making the `effective` number of samples 10.
+<6> `config` is a dictionary that provides key-value pairs corresponding to the evaluation job as a whole. This includes information on the type of model run, the `model_args`, and link:#crd[other settings] used for the run. Many of the values in this dictionary in this example are the default values defined by `lm-evaluation-harness`.
+<7> Given at the very end are three fields describing the start, end, and total evaluation time for this job.
+
+The remaining key-value pairs define a variety of environment settings used for this evaluation job. 
+
 == Examples
 
 === Environment Variables
@@ -313,7 +504,7 @@ Or you can create a secret to store the token and refer the key from the secret
             key: hf-token
 ----
 
-=== Custom Unitxt Card
+=== Custom Unitxt Card [[custom_card]]
 
 Pass a custom Unitxt Card in JSON format: