Skip to content

Commit 0464feb

Browse files
Add a section to LMEval doc explaining the results.json output (#49)
* Add a section to LMEval doc explaining the results.json output * Address review; add details on YAML of LMEvalJob * Add further details to clarify cancelled/complete state and reason
1 parent d1d220c commit 0464feb

File tree

1 file changed

+194
-3
lines changed

1 file changed

+194
-3
lines changed

docs/modules/ROOT/pages/lm-eval-tutorial.adoc

+194-3
Original file line numberDiff line numberDiff line change
@@ -141,9 +141,9 @@ Here are the example results:
141141
}
142142
----
143143

144-
The `f1_micro`, `f1_macro`, and `accuracy` scores are 0.56, 0.36, and 0.56. The full results are stored in the `.status.results` of the `LMEvalJob` object as a JSON document. The command above only retrieves the `results` field of the JSON document.
144+
The `f1_micro`, `f1_macro`, and `accuracy` scores are 0.56, 0.36, and 0.56. The full results are stored in the `.status.results` of the `LMEvalJob` object as a JSON document. The command above only retrieves the `results` field of the JSON document. See <<output>> for more details.
145145

146-
== Details of LMEvalJob
146+
== Details of LMEvalJob [[crd]]
147147

148148
In this section, let's review each property in the LMEvalJob and its usage.
149149

@@ -270,6 +270,197 @@ Specify extra information for the lm-eval job's pod.
270270
|Mount a PVC as the local storage for models and datasets.
271271
|===
272272

273+
== Output of LMEvalJob [[output]]
274+
275+
The output of an LMEvalJob is a YAML document with several fields. The `status` section provides the relevant information about the current status and, if the job successfully completes, the results for an evaluation.
276+
277+
[source,yaml]
278+
----
279+
apiVersion: trustyai.opendatahub.io/v1alpha1
280+
kind: LMEvalJob
281+
metadata:
282+
annotations:
283+
kubectl.kubernetes.io/last-applied-configuration: |
284+
{"apiVersion":"trustyai.opendatahub.io/v1alpha1","kind":"LMEvalJob","metadata":{"annotations":{},"name":"lmeval-test","namespace":"test"},"spec":{"allowCodeExecution":true,"allowOnline":true,"logSamples":true,"model":"hf","modelArgs":[{"name":"pretrained","value":"google/flan-t5-base"}],"taskList":{"taskRecipes":[{"card":{"name":"cards.wnli"},"template":"templates.classification.multi_class.relation.default"}]}}}
285+
creationTimestamp: "2025-02-06T18:13:35Z"
286+
finalizers:
287+
- trustyai.opendatahub.io/lmes-finalizer
288+
generation: 1
289+
name: lmeval-test
290+
namespace: test
291+
resourceVersion: "19604113"
292+
uid: e1d29da2-bf3e-4f46-8907-6018e5741eb4
293+
spec:
294+
allowCodeExecution: true
295+
allowOnline: true
296+
logSamples: true
297+
model: hf
298+
modelArgs:
299+
- name: pretrained
300+
value: google/flan-t5-base
301+
taskList:
302+
taskRecipes:
303+
- card:
304+
name: cards.wnli
305+
template: templates.classification.multi_class.relation.default
306+
status:
307+
completeTime: "2025-02-06T18:31:20Z"
308+
lastScheduleTime: "2025-02-06T18:13:35Z"
309+
message: job completed <1>
310+
podName: lmeval-test
311+
reason: Succeeded <2>
312+
results: |- <3>
313+
{
314+
...
315+
}
316+
state: Complete <4>
317+
----
318+
319+
<1> A `message` provides an explanation related to the current or final status of an LMEvalJob. If the job reason is `Failed`, the related error message will be shown here.
320+
<2> A one-word `reason` that corresponds to the given `state` of the job at this time. Possible values are:
321+
322+
* `NoReason`: The job is still running
323+
* `Succeeded`: The job finished successfully
324+
* `Failed`: The job failed
325+
* `Cancelled`: The job was cancelled
326+
327+
<3> The `results` field is the direct output of an `lm-evaluation-harness` run. It has been omitted here to avoid repetition. The link:#output[next section] gives an example of the contents of this section. This section will be empty if the job is not completed.
328+
<4> The current `state` of this job. The `reason` for a particular state is given in the `reason` field. Possible values are:
329+
330+
* `New`: The job was just created
331+
* `Scheduled`: The job is scheduled and waiting for available resources to run
332+
* `Running`: The job is currently running
333+
* `Complete`: The job is complete. This may correspond to either the `Succeeded`, `Failed`, or `Cancelled` reason.
334+
* `Cancelled`: Job cancellation has been initiated. The `state` will update to `Complete` once the cancellation has been processed by the job controller.
335+
* `Suspended`: The job has been suspended
336+
337+
=== `results` section
338+
339+
The `results` field is the direct output of an `lm-evaluation-harness` run. Below is an example of the file that is returned after an `lm-evaluation-harness` evaluation run and, consequently, the contents of the `results` dictionary of the LMEvalJob output YAML. This file may look slightly different depending on what link:#crd[options] are passed.
340+
341+
The example shown here is of a Unitxt task called `tr_0` that corresponds to the custom Unitxt task that is shown in link:#custom_card[this section].
342+
343+
[source,json]
344+
----
345+
{
346+
"results": { <1>
347+
"tr_0": {
348+
"alias": "tr_0",
349+
"f1_micro,none": 0.5,
350+
"f1_micro_stderr,none": "N/A",
351+
"accuracy,none": 0.5,
352+
"accuracy_stderr,none": "N/A",
353+
"f1_macro,none": 0.3333333333333333,
354+
"f1_macro_stderr,none": "N/A"
355+
}
356+
},
357+
"group_subtasks": { <2>
358+
"tr_0": []
359+
},
360+
"configs": { <3>
361+
"tr_0": {
362+
"task": "tr_0",
363+
"dataset_name": "card=cards.wnli,template=templates.classification.multi_class.relation.default",
364+
"unsafe_code": false,
365+
"description": "",
366+
"target_delimiter": " ",
367+
"fewshot_delimiter": "\n\n",
368+
"num_fewshot": 0,
369+
"output_type": "generate_until",
370+
"generation_kwargs": {
371+
"until": [
372+
"\n\n"
373+
],
374+
"do_sample": false
375+
},
376+
"repeats": 1,
377+
"should_decontaminate": false,
378+
"metadata": {
379+
"version": 0
380+
}
381+
}
382+
},
383+
"versions": { <4>
384+
"tr_0": 0
385+
},
386+
"n-shot": { <4>
387+
"tr_0": 0
388+
},
389+
"higher_is_better": { <5>
390+
"tr_0": {
391+
"f1_micro": true,
392+
"accuracy": true,
393+
"f1_macro": true
394+
}
395+
},
396+
"n-samples": { <5>
397+
"tr_0": {
398+
"original": 71,
399+
"effective": 10
400+
}
401+
},
402+
"config": { <6>
403+
"model": "hf",
404+
"model_args": "pretrained=hf_home/flan-t5-base",
405+
"model_num_parameters": 247577856,
406+
"model_dtype": "torch.float32",
407+
"model_revision": "main",
408+
"model_sha": "",
409+
"batch_size": 1,
410+
"batch_sizes": [],
411+
"use_cache": null,
412+
"limit": 10.0,
413+
"bootstrap_iters": 100000,
414+
"gen_kwargs": null,
415+
"random_seed": 0,
416+
"numpy_seed": 1234,
417+
"torch_seed": 1234,
418+
"fewshot_seed": 1234
419+
},
420+
"git_hash": "af2d2f3e",
421+
"date": 1740763246.8746712,
422+
"pretty_env_info": "PyTorch version: 2.5.1\nIs debug build: False\nCUDA used to build PyTorch: None\nROCM used to build PyTorch: N/A\n\nOS: macOS 15.3.1 (arm64)\nGCC version: Could not collect\nClang version: 16.0.0 (clang-1600.0.26.3)\nCMake version: Could not collect\nLibc version: N/A\n\nPython version: 3.11.11 (main, Dec 11 2024, 10:25:04) [Clang 14.0.6 ] (64-bit runtime)\nPython platform: macOS-15.3.1-arm64-arm-64bit\nIs CUDA available: False\nCUDA runtime version: No CUDA\nCUDA_MODULE_LOADING set to: N/A\nGPU models and configuration: No CUDA\nNvidia driver version: No CUDA\ncuDNN version: No CUDA\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nApple M1 Max\n\nVersions of relevant libraries:\n[pip3] mypy==1.15.0\n[pip3] mypy-extensions==1.0.0\n[pip3] numpy==2.2.2\n[pip3] torch==2.5.1\n[conda] numpy 2.2.2 pypi_0 pypi\n[conda] torch 2.5.1 pypi_0 pypi",
423+
"transformers_version": "4.48.1",
424+
"upper_git_hash": null,
425+
"tokenizer_pad_token": [
426+
"<pad>",
427+
"0"
428+
],
429+
"tokenizer_eos_token": [
430+
"</s>",
431+
"1"
432+
],
433+
"tokenizer_bos_token": [
434+
null,
435+
"None"
436+
],
437+
"eot_token_id": 1,
438+
"max_length": 512,
439+
"task_hashes": {},
440+
"model_source": "hf",
441+
"model_name": "hf_home/flan-t5-base",
442+
"model_name_sanitized": "hf_home__flan-t5-base",
443+
"system_instruction": null,
444+
"system_instruction_sha": null,
445+
"fewshot_as_multiturn": false,
446+
"chat_template": null,
447+
"chat_template_sha": null,
448+
"start_time": 84598.410512833, <7>
449+
"end_time": 84647.782769875,
450+
"total_evaluation_time_seconds": "49.37225704200682"
451+
}
452+
----
453+
454+
<1> `results` is a dictionary of tasks keyed by task name. For each task, the calculated metrics are shown. These metrics are dependant on the task definition. `results` is a flat dictionary, so if a task has subtasks, they will not be nested under a parent task but are rather their own entry.
455+
<2> `group_subtasks` is a dictionary of tasks keyed by name with the value for each being a list of strings corresponding to subtasks for this task. `group_subtasks` is empty in this example because there are no subtasks.
456+
<3> `configs` is a dictionary of tasks keyed by task name that shows the configuration options for each task run. These key-value pairs are provided by the task definition (or default values) and will vary depending on the type of task run.
457+
<4> `versions` and `n-shot` are flat dictionaries with one key for each task run. The value in the `versions` dictionary is the version of the given task (or 0 by default). The value in the `n-shot` dictionary is the number of few-shot examples that were placed in context when running the task. This information is also available in the `configs` dictionary.
458+
<5> `higher_is_better` and `n-samples` are dictionaries with one key-dictionary pair for each task run. The former provides information as to whether a higher score is considered better for each metric that was evaluated for that task. The latter gives, for each task, the number of samples used during evaluation. In this example, the `--limit` property was set to 10, making the `effective` number of samples 10.
459+
<6> `config` is a dictionary that provides key-value pairs corresponding to the evaluation job as a whole. This includes information on the type of model run, the `model_args`, and link:#crd[other settings] used for the run. Many of the values in this dictionary in this example are the default values defined by `lm-evaluation-harness`.
460+
<7> Given at the very end are three fields describing the start, end, and total evaluation time for this job.
461+
462+
The remaining key-value pairs define a variety of environment settings used for this evaluation job.
463+
273464
== Examples
274465

275466
=== Environment Variables
@@ -313,7 +504,7 @@ Or you can create a secret to store the token and refer the key from the secret
313504
key: hf-token
314505
----
315506

316-
=== Custom Unitxt Card
507+
=== Custom Unitxt Card [[custom_card]]
317508

318509
Pass a custom Unitxt Card in JSON format:
319510

0 commit comments

Comments
 (0)