|
1 |
| -## llama2.c |
| 1 | +## llama2.rs |
| 2 | + |
| 3 | +A Rust port of [llama2.c](https://huggingface.co/karpathy/llama2.c). |
| 4 | + |
| 5 | +The goal of `llama2.rs` is to create a rust port for llama2.c, |
| 6 | +primarily targeting at a cross-platform implementation for on-device inference. |
| 7 | + |
| 8 | +Features to highlight: |
| 9 | +- Similar to `llama2.c` with openmp, `llama2.rs` also utilizes model parallelization. |
| 10 | +- Utilize memory map for save runtime memory (with a flag `--is_mmap`). |
| 11 | + |
| 12 | +### How to build and run inference. |
| 13 | + |
| 14 | +**Prerequisite**: Download pretrained tinyllamas models. |
| 15 | + |
| 16 | +```bash |
| 17 | +# stories15M is used for test and stories 110M is used for benchmark. |
| 18 | +wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin |
| 19 | +wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin |
| 20 | +wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin |
| 21 | +``` |
| 22 | + |
| 23 | +You can use `cargo` to build and run inference for `stories15M` model: |
| 24 | + |
| 25 | +```bash |
| 26 | +cargo run --release -- --model_path=./stories15M.bin |
| 27 | +``` |
| 28 | + |
| 29 | +See `cargo run --release -- --help` for the full help doc. |
| 30 | + |
| 31 | +You can run unit test with the below command with `stories15M.bin` downloaded in advance. |
| 32 | + |
| 33 | +```bash |
| 34 | +cargo test |
| 35 | +``` |
| 36 | + |
| 37 | +The command to run bechmark with `stories110M.bin` is: |
| 38 | + |
| 39 | +```bash |
| 40 | +cargo run --release -- --model_path=./stories110M.bin --is_benchmark |
| 41 | +``` |
| 42 | + |
| 43 | +### Performance comparison. |
| 44 | + |
| 45 | +We conduct the inference benchmark on `stories110M.bin`, |
| 46 | +and comparing with llama2.c and Huggingface's [candle](https://github.com/huggingface/candle) library. |
| 47 | + |
| 48 | +The performance is based on 10 repeated experiments on my Macbook, |
| 49 | +and calculate the mean of standard deviation. Here is my spec: |
| 50 | + |
| 51 | +- 2.6 GHz 6-Core Intel Core i7, L2/L3 Cache: 256 KB/12 MB. |
| 52 | +- Memery: 16 GB 2667 MHz DDR4. Disk: APPLE SSD. |
| 53 | +- OS: Mac OS 13.5. |
| 54 | +- CC: Apple clang version 14.0.0. |
| 55 | +- Rust: rustc 1.71.1. |
| 56 | + |
| 57 | +|-------------------|-----------------------------| |
| 58 | +| Experiments | #Token/s: mean (+- std) | |
| 59 | +|-------------------|-----------------------------| |
| 60 | +| llama2.rs | 40.228 (+-1.691) | |
| 61 | +| llama2.rs (mmap) | 37.736 (+-1.864) | |
| 62 | +| llama2.c | 27.585 (+-2.003) | |
| 63 | +| candle | 12.534 (+-0.417) | |
| 64 | +|-------------------|-----------------------------| |
| 65 | + |
| 66 | +Notes: |
| 67 | +- mmap: Run with flag `--is_mmap`. Peak memory cost: 480MB -> 9MB. |
| 68 | + |
| 69 | +- [llama2.c](https://huggingface.co/karpathy/llama2.c) is built and run with opts openmp+fast: |
| 70 | + |
| 71 | +```bash |
| 72 | +clang -Ofast -fopenmp -march=native run.c -lm -o run |
| 73 | +./run stories110M.bin |
| 74 | +``` |
| 75 | + (You may need LLVM and openmp to be installed.) |
| 76 | + |
| 77 | +- [candle](https://github.com/huggingface/candle) is built with `accelerate` feature: |
| 78 | + |
| 79 | +```bash |
| 80 | +cargo run --release --features accelerate --package candle-examples inference --which-model=stories110M.bin |
| 81 | +``` |
| 82 | + |
| 83 | +## README.md of original [llama2.c](https://github.com/karpathy/llama2.c) |
2 | 84 |
|
3 | 85 | <p align="center">
|
4 | 86 | <img src="assets/llama_cute.jpg" width="300" height="300" alt="Cute Llama">
|
|
0 commit comments