Skip to content

Commit 5ea7515

Browse files
authored
Update index.md
1 parent d737900 commit 5ea7515

File tree

1 file changed

+25
-14
lines changed

1 file changed

+25
-14
lines changed

docs/index.md

+25-14
Original file line numberDiff line numberDiff line change
@@ -27,39 +27,50 @@
2727
<div class="grid cards" markdown>
2828

2929

30-
- :material-book:{ .lg .middle } __Leaderboard__
30+
- :material-play:{ .lg .middle } __Github Repo__
3131

32-
---
32+
---
3333

34-
How good are LMs at science, really?
34+
Learn how to evaluate your model
3535

36-
[:octicons-arrow-right-24: Browse the results](leaderboard.md)
36+
[:octicons-arrow-right-24: Installation & usage](https://github.com/scicode-bench/SciCode)
3737

38-
- :material-book:{ .lg .middle } __Paper__
3938

40-
---
39+
- :material-book:{ .lg .middle } __Paper__
4140

42-
Learn all the details
41+
---
4342

44-
[:octicons-arrow-right-24: Read the paper](https://arxiv.com)
45-
</div>
43+
Learn all the details
44+
45+
[:octicons-arrow-right-24: Read the paper](https://arxiv.com)
46+
</div>
4647

4748

4849

4950
<div class="grid cards" markdown>
5051

5152

5253

53-
- :material-play:{ .lg .middle } __Installation & usage__
54+
- :material-book:{ .lg .middle } __Dataset__
55+
56+
---
57+
58+
Dataset
59+
60+
[:octicons-arrow-right-24: Download Dataset](leaderboard.md)
61+
5462

55-
---
63+
- :material-book:{ .lg .middle } __Leaderboard__
5664

57-
Learn how to evaluate your model
65+
---
5866

59-
[:octicons-arrow-right-24: Read the docs](docs/index.md)
67+
How good are LMs at science, really?
68+
(Coming soon...)
6069

61-
</div>
70+
[:octicons-arrow-right-24: Browse the results](leaderboard.md)
71+
</div>
6272

73+
6374

6475
## Introduction
6576
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only **4.6%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.

0 commit comments

Comments
 (0)