|
| 1 | +# bitnet.cpp |
| 2 | +[](https://opensource.org/licenses/MIT) |
| 3 | + |
| 4 | + |
| 5 | +bitnet.cpp is the official inference framework for BitNet models (e.g., BitNet b1.58), optimized for CPU devices. It offers a suite of optimized kernels, that support lossless inference of 1.58-bit models on both x86 and ARM architectures. |
| 6 | + |
| 7 | +## Demo |
| 8 | + |
| 9 | +A demo of bitnet.cpp runing a BitNet b1.58 3B model on Apple M2: |
| 10 | + |
| 11 | +https://github.com/user-attachments/assets/7f46b736-edec-4828-b809-4be780a3e5b1 |
| 12 | + |
| 13 | +## Timeline |
| 14 | + |
| 15 | +- 10/17/2024 bitnet.cpp 1.0 released. |
| 16 | +- 02/27/2024 [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/abs/2402.17764) |
| 17 | +- 10/17/2023 [BitNet: Scaling 1-bit Transformers for Large Language Models](https://arxiv.org/abs/2310.11453) |
| 18 | + |
| 19 | +## Supported Models |
| 20 | + |
| 21 | +bitnet.cpp supports a list of 1-bit models available on [Hugging Face](https://huggingface.co/) |
| 22 | + |
| 23 | + |
| 24 | +<table> |
| 25 | + </tr> |
| 26 | + <tr> |
| 27 | + <th rowspan="2">Model</th> |
| 28 | + <th rowspan="2">Parameters</th> |
| 29 | + <th rowspan="2">CPU</th> |
| 30 | + <th colspan="3">Kernel</th> |
| 31 | + </tr> |
| 32 | + <tr> |
| 33 | + <th>I2_S</th> |
| 34 | + <th>TL1</th> |
| 35 | + <th>TL2</th> |
| 36 | + </tr> |
| 37 | + <tr> |
| 38 | + <td rowspan="2"><a href="https://huggingface.co/1bitLLM/bitnet_b1_58-large">bitnet_b1_58-large</a></td> |
| 39 | + <td rowspan="2">0.7B</td> |
| 40 | + <td>x86</td> |
| 41 | + <td>✔</td> |
| 42 | + <td>✘</td> |
| 43 | + <td>✔</td> |
| 44 | + </tr> |
| 45 | + <tr> |
| 46 | + <td>ARM</td> |
| 47 | + <td>✔</td> |
| 48 | + <td>✔</td> |
| 49 | + <td>✘</td> |
| 50 | + </tr> |
| 51 | + <tr> |
| 52 | + <td rowspan="2"><a href="https://huggingface.co/1bitLLM/bitnet_b1_58-3B">bitnet_b1_58-3B</a></td> |
| 53 | + <td rowspan="2">3.3B</td> |
| 54 | + <td>x86</td> |
| 55 | + <td>✘</td> |
| 56 | + <td>✘</td> |
| 57 | + <td>✔</td> |
| 58 | + </tr> |
| 59 | + <tr> |
| 60 | + <td>ARM</td> |
| 61 | + <td>✘</td> |
| 62 | + <td>✔</td> |
| 63 | + <td>✘</td> |
| 64 | + </tr> |
| 65 | + <tr> |
| 66 | + <td rowspan="2"><a href="https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens">Llama3-8B-1.58-100B-tokens</a></td> |
| 67 | + <td rowspan="2">8.0B</td> |
| 68 | + <td>x86</td> |
| 69 | + <td>✔</td> |
| 70 | + <td>✘</td> |
| 71 | + <td>✔</td> |
| 72 | + </tr> |
| 73 | + <tr> |
| 74 | + <td>ARM</td> |
| 75 | + <td>✔</td> |
| 76 | + <td>✔</td> |
| 77 | + <td>✘</td> |
| 78 | + </tr> |
| 79 | +</table> |
| 80 | + |
| 81 | + |
| 82 | + |
| 83 | +## Installation |
| 84 | + |
| 85 | +### Requirements |
| 86 | +- python>=3.9 |
| 87 | +- cmake>=3.22 |
| 88 | +- clang>=18 |
| 89 | + - For Windows users, install [Visual Studio 2022](https://visualstudio.microsoft.com/downloads/). In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake): |
| 90 | + - Desktop-development with C++ |
| 91 | + - C++-CMake Tools for Windows |
| 92 | + - Git for Windows |
| 93 | + - C++-Clang Compiler for Windows |
| 94 | + - MS-Build Support for LLVM-Toolset (clang) |
| 95 | + - For Debian/Ubuntu users, you can download with [Automatic installation script](https://apt.llvm.org/) |
| 96 | + |
| 97 | + ` bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"` |
| 98 | +- conda (highly recommend) |
| 99 | + |
| 100 | +### Build from source |
| 101 | + |
| 102 | +> [!IMPORTANT] |
| 103 | +> If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands |
| 104 | +
|
| 105 | +1. Clone the repo |
| 106 | +```bash |
| 107 | +git clone --recursive https://github.com/microsoft/BitNet.git |
| 108 | +cd BitNet |
| 109 | +``` |
| 110 | +2. Install the dependencies |
| 111 | +```bash |
| 112 | +# (Recommended) Create a new conda environment |
| 113 | +conda create -n bitnet-cpp python=3.9 |
| 114 | +conda activate bitnet-cpp |
| 115 | + |
| 116 | +pip install -r requirements.txt |
| 117 | +``` |
| 118 | +3. Build the project |
| 119 | +```bash |
| 120 | +# Download the model from Hugging Face, convert it to quantized gguf format, and build the project |
| 121 | +python setup_env.py --hf-repo HF1BitLLM/Llama3-8B-1.58-100B-tokens -q i2_s |
| 122 | + |
| 123 | +# Or you can manually download the model and run with local path |
| 124 | +huggingface-cli download HF1BitLLM/Llama3-8B-1.58-100B-tokens --local-dir models/Llama3-8B-1.58-100B-tokens |
| 125 | +python setup_env.py -md models/Llama3-8B-1.58-100B-tokens -q i2_s |
| 126 | +``` |
| 127 | +<pre> |
| 128 | +usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd] |
| 129 | + [--use-pretuned] |
| 130 | + |
| 131 | +Setup the environment for running inference |
| 132 | + |
| 133 | +optional arguments: |
| 134 | + -h, --help show this help message and exit |
| 135 | + --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens} |
| 136 | + Model used for inference |
| 137 | + --model-dir MODEL_DIR, -md MODEL_DIR |
| 138 | + Directory to save/load the model |
| 139 | + --log-dir LOG_DIR, -ld LOG_DIR |
| 140 | + Directory to save the logging info |
| 141 | + --quant-type {i2_s,tl1}, -q {i2_s,tl1} |
| 142 | + Quantization type |
| 143 | + --quant-embd Quantize the embeddings to f16 |
| 144 | + --use-pretuned, -p Use the pretuned kernel parameters |
| 145 | +</pre> |
| 146 | +## Usage |
| 147 | +### Basic usage |
| 148 | +```bash |
| 149 | +# Run inference with the quantized model |
| 150 | +python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:" -n 6 -temp 0 |
| 151 | + |
| 152 | +# Output: |
| 153 | +# Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary? |
| 154 | +# Answer: Mary is in the garden. |
| 155 | + |
| 156 | +``` |
| 157 | +<pre> |
| 158 | +usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] |
| 159 | + |
| 160 | +Run inference |
| 161 | + |
| 162 | +optional arguments: |
| 163 | + -h, --help show this help message and exit |
| 164 | + -m MODEL, --model MODEL |
| 165 | + Path to model file |
| 166 | + -n N_PREDICT, --n-predict N_PREDICT |
| 167 | + Number of tokens to predict when generating text |
| 168 | + -p PROMPT, --prompt PROMPT |
| 169 | + Prompt to generate text from |
| 170 | + -t THREADS, --threads THREADS |
| 171 | + Number of threads to use |
| 172 | + -c CTX_SIZE, --ctx-size CTX_SIZE |
| 173 | + Size of the prompt context |
| 174 | + -temp TEMPERATURE, --temperature TEMPERATURE |
| 175 | + Temperature, a hyperparameter that controls the randomness of the generated text |
| 176 | +</pre> |
| 177 | + |
| 178 | +### Benchmark |
| 179 | +We provide scripts to run the inference benchmark providing a model. |
| 180 | + |
| 181 | +``` |
| 182 | +usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS] |
| 183 | + |
| 184 | +Setup the environment for running the inference |
| 185 | + |
| 186 | +required arguments: |
| 187 | + -m MODEL, --model MODEL |
| 188 | + Path to the model file. |
| 189 | + |
| 190 | +optional arguments: |
| 191 | + -h, --help |
| 192 | + Show this help message and exit. |
| 193 | + -n N_TOKEN, --n-token N_TOKEN |
| 194 | + Number of generated tokens. |
| 195 | + -p N_PROMPT, --n-prompt N_PROMPT |
| 196 | + Prompt to generate text from. |
| 197 | + -t THREADS, --threads THREADS |
| 198 | + Number of threads to use. |
| 199 | +``` |
| 200 | + |
| 201 | +Here's a brief explanation of each argument: |
| 202 | + |
| 203 | +- `-m`, `--model`: The path to the model file. This is a required argument that must be provided when running the script. |
| 204 | +- `-n`, `--n-token`: The number of tokens to generate during the inference. It is an optional argument with a default value of 128. |
| 205 | +- `-p`, `--n-prompt`: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512. |
| 206 | +- `-t`, `--threads`: The number of threads to use for running the inference. It is an optional argument with a default value of 2. |
| 207 | +- `-h`, `--help`: Show the help message and exit. Use this argument to display usage information. |
| 208 | + |
| 209 | +For example: |
| 210 | + |
| 211 | +```sh |
| 212 | +python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4 |
| 213 | +``` |
| 214 | + |
| 215 | +This command would run the inference benchmark using the model located at `/path/to/model`, generating 200 tokens from a 256 token prompt, utilizing 4 threads. |
| 216 | + |
| 217 | +For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine: |
| 218 | + |
| 219 | +```bash |
| 220 | +python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M |
| 221 | + |
| 222 | +# Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate |
| 223 | +python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128 |
| 224 | +``` |
| 225 | + |
| 226 | +## Acknowledgements |
| 227 | + |
| 228 | +This project is based on the [llama.cpp](https://github.com/ggerganov/llama.cpp) framework. We would like to thank all the authors for their contributions to the open-source community. We also thank [T-MAC](https://github.com/microsoft/T-MAC/) team for the helpful discussion on the LUT method for low-bit LLM inference. |
0 commit comments