|
1 |
| -<p align="center"> |
| 1 | +<!-- <p align="center"> |
2 | 2 | <img src='logo.png' width='200'>
|
3 |
| -</p> |
| 3 | +</p> --> |
4 | 4 |
|
5 |
| -# arxiv2025_repa_prm |
| 5 | +# SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling |
6 | 6 | [](https://put-here-your-paper.com)
|
7 | 7 | [](https://huggingface.co/UKPLab/Llama-3-8b-spare-prm-math)
|
8 | 8 | [](https://www.apache.org/licenses/LICENSE-2.0)
|
9 | 9 | [](https://www.python.org/)
|
10 | 10 | [](https://github.com/UKPLab/arxiv2025-repa-prm/actions/workflows/main.yml)
|
11 | 11 |
|
12 |
| -This is the official template for new Python projects at UKP Lab. It was adapted for the needs of UKP Lab from the excellent [python-project-template](https://github.com/rochacbruno/python-project-template/) by [rochacbruno](https://github.com/rochacbruno). |
| 12 | +## Description: |
13 | 13 |
|
14 |
| -It should help you start your project and give you continuous status updates on the development through [GitHub Actions](https://docs.github.com/en/actions). |
| 14 | +This repository includes the training, inference and evaluation code used in our Arxiv 2025 paper - [SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling](). |
15 | 15 |
|
16 |
| -> **Abstract:** Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce **S**ingle-**P**ass **A**nnotation with **R**eference-Guided **E**valuation (**SPARE**), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that *SPARE*, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, *SPARE* achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. |
| 16 | +<!-- This is the official template for new Python projects at UKP Lab. It was adapted for the needs of UKP Lab from the excellent [python-project-template](https://github.com/rochacbruno/python-project-template/) by [rochacbruno](https://github.com/rochacbruno). |
17 | 17 |
|
18 |
| -Contact person: [Md Imbesat Hassan Rizvi](mailto:[email protected]) |
| 18 | +It should help you start your project and give you continuous status updates on the development through [GitHub Actions](https://docs.github.com/en/actions). --> |
19 | 19 |
|
20 |
| -[UKP Lab](https://www.ukp.tu-darmstadt.de/) | [TU Darmstadt](https://www.tu-darmstadt.de/ |
21 |
| -) |
| 20 | +> **Abstract:** Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce **S**ingle-**P**ass **A**nnotation with **R**eference-Guided **E**valuation (**SPARE**), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that *SPARE*, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, *SPARE* achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. |
22 | 21 |
|
23 |
| -Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions. |
| 22 | +## Installation |
24 | 23 |
|
| 24 | +Create a `conda` / `mamba` / `venv` virtual environment and install the dependencies in `requirements.txt`. E.g.: |
25 | 25 |
|
26 |
| -## Getting Started |
| 26 | +```bash |
| 27 | +mamba create -n spare |
| 28 | +mamba activate spare |
| 29 | +pip install -r requirements.txt |
| 30 | +``` |
27 | 31 |
|
28 |
| -> **DO NOT CLONE OR FORK** |
| 32 | +## Running the experiments |
29 | 33 |
|
30 |
| -If you want to set up this template: |
| 34 | +The parameters of the experiments are specified in their respecive `config` files: |
31 | 35 |
|
32 |
| -1. Request a repository on UKP Lab's GitHub by following the standard procedure on the wiki. It will install the template directly. Alternatively, set it up in your personal GitHub account by clicking **[Use this template](https://github.com/rochacbruno/python-project-template/generate)**. |
33 |
| -2. Wait until the first run of CI finishes. Github Actions will commit to your new repo with a "✅ Ready to clone and code" message. |
34 |
| -3. Delete optional files: |
35 |
| - - If you don't need automatic documentation generation, you can delete folder `docs`, file `.github\workflows\docs.yml` and `mkdocs.yml` |
36 |
| - - If you don't want automatic testing, you can delete folder `tests` and file `.github\workflows\tests.yml` |
37 |
| - - If you do not wish to have a project page, delete folder `static` and files `.nojekyll`, `index.html` |
38 |
| -4. Prepare a virtual environment: |
39 | 36 | ```bash
|
40 |
| -python -m venv .venv |
41 |
| -source .venv/bin/activate |
42 |
| -pip install . |
43 |
| -pip install -r requirements-dev.txt # Only needed for development |
| 37 | +config/ |
| 38 | +├── eval-config.yaml |
| 39 | +├── infer-config.yaml |
| 40 | +├── infer-rm-config.yaml |
| 41 | +├── private-config.yaml |
| 42 | +├── train-po-config.yaml |
| 43 | +├── train-sft-config.yaml |
| 44 | +└── train-tc-rm-config.yaml |
44 | 45 | ```
|
45 |
| -5. Adapt anything else (for example this file) to your project. |
46 | 46 |
|
47 |
| -6. Read the file [ABOUT_THIS_TEMPLATE.md](ABOUT_THIS_TEMPLATE.md) for more information about development. |
| 47 | +The private api keys such as for using OpenAI models or logging through Neptune API can be provided in the `private-config.yaml` file. |
48 | 48 |
|
49 |
| -## Usage |
| 49 | +To run a desired task e.g. token classification based reward model (`tc-rm`), execute the following command: |
50 | 50 |
|
51 |
| -### Using the classes |
| 51 | +```bash |
| 52 | +python train_rm.py # to use the default location of the train-tc-rm-config |
| 53 | +# OR alternatively |
| 54 | +python train_rm.py --config my-train-tc-rm-config.yaml |
| 55 | +``` |
52 | 56 |
|
53 |
| -To import classes/methods of `arxiv2025_repa_prm` from inside the package itself you can use relative imports: |
| 57 | +A trained SPARE-PRM model based on Llama-3-8b is provided for direct-use at [](https://huggingface.co/UKPLab/Llama-3-8b-spare-prm-math). A sample code to use it is given below: |
54 | 58 |
|
55 |
| -```py |
56 |
| -from .base import BaseClass # Notice how I omit the package name |
| 59 | +```python |
| 60 | +from transformers import AutoTokenizer |
| 61 | +from transformers import AutoModelForCausalLM |
| 62 | +import torch |
57 | 63 |
|
58 |
| -BaseClass().something() |
59 |
| -``` |
| 64 | +incorrect_token = "-" |
| 65 | +correct_token = "+" |
| 66 | +step_tag = " ки" # space in the beginning required for correct Llama tokenization |
60 | 67 |
|
61 |
| -To import classes/methods from outside the package (e.g. when you want to use the package in some other project) you can instead refer to the package name: |
| 68 | +tokenizer = AutoTokenizer.from_pretrained("UKPLab/Llama-3-8b-spare-prm-math") |
62 | 69 |
|
63 |
| -```py |
64 |
| -from arxiv2025_repa_prm import BaseClass # Notice how I omit the file name |
65 |
| -from arxiv2025_repa_prm.subpackage import SubPackageClass # Here it's necessary because it's a subpackage |
| 70 | +step_target_ids = tokenizer.convert_tokens_to_ids([incorrect_token, correct_token]) |
| 71 | +step_tag_id = tokenizer.encode(step_tag)[-1] |
66 | 72 |
|
67 |
| -BaseClass().something() |
68 |
| -SubPackageClass().something() |
69 |
| -``` |
| 73 | +device = "cuda:0" |
| 74 | +model = AutoModelForCausalLM.from_pretrained("UKPLab/Llama-3-8b-spare-prm-math").to(device).eval() |
70 | 75 |
|
71 |
| -### Using scripts |
| 76 | +# include this instruction as it was left as it is during the PRM training. |
| 77 | +instruction = "You are an expert at solving challenging math problems spanning across various categories and difficulties such as Algebra, Number Theory, Geometry, Counting and Probability, Precalculus etc. For a given math problem, your task is to generate a step-by-step reasoning-based solution providing an answer to the question. Identify the correct concepts, formulas and heuristics that needs to be applied and then derive the contents of the reasoning steps from the given contexts and accurate calculations from the previous reasoning steps." |
| 78 | +question = "Yann and Camille go to a restaurant. </S>\nIf there are 10 items on the menu, and each orders one dish, how many different combinations of meals can Yann and Camille order if they refuse to order the same dish? (It does matter who orders what---Yann ordering chicken and Camille ordering fish is different from Yann ordering fish and Camille ordering chicken.)" |
| 79 | +correct_generation = "Let's think step by step.\nYann can order 1 of the 10 dishes. ки\nWhen he picks a dish, there are 9 left for Camille to choose from. ки\nThus, there are $10\\cdot 9=\\boxed{90}$ possible combinations.\nHence, the answer is 90. ки\n" |
| 80 | +incorrect_generation = "Let's think step by step.\nWithout any restrictions, Yann and Camille could both order the same dish out of the 10 options, for a total of $10 \\cdot 9$ dishes. ки\nHowever, since Yann orders one of the 9 dishes that Camille didn't order (and vice versa), the number of possible combinations becomes $10 \\cdot 9 - 8 = \\boxed{72}$.\nHence, the answer is 72. ки\n" |
72 | 81 |
|
73 |
| -This is how you can use `arxiv2025_repa_prm` from command line: |
| 82 | +for generation in (correct_generation, incorrect_generation): |
| 83 | + message = [ |
| 84 | + dict(role="system", content=instruction), |
| 85 | + dict(role="user", content=question), |
| 86 | + dict(role="user", content=generation), |
| 87 | + ] |
74 | 88 |
|
75 |
| -```bash |
76 |
| -$ python -m arxiv2025_repa_prm |
77 |
| -``` |
| 89 | + input_ids = tokenizer.apply_chat_template(message, tokenize=True, return_tensors="pt").to(device) |
78 | 90 |
|
79 |
| -### Expected results |
| 91 | + with torch.no_grad(): |
| 92 | + logits = model(input_ids).logits[:,:,step_target_ids] |
| 93 | + scores = logits.softmax(dim=-1)[:,:,1] # correct_token at index 1 in the step_target_ids |
| 94 | + step_scores = scores[input_ids == step_tag_id] |
| 95 | + print(step_scores) |
| 96 | + |
| 97 | +# tensor([0.9561, 0.9496, 0.9527]) - correct_generation |
| 98 | +# tensor([0.6638, 0.6755]) - incorrect_generation |
| 99 | +``` |
80 | 100 |
|
81 |
| -After running the experiments, you should expect the following results: |
| 101 | +Contact person: [Md Imbesat Hassan Rizvi](mailto:[email protected]) |
82 | 102 |
|
83 |
| -(Feel free to describe your expected results here...) |
| 103 | +[UKP Lab](https://www.ukp.tu-darmstadt.de/) | [TU Darmstadt](https://www.tu-darmstadt.de/ |
| 104 | +) |
84 | 105 |
|
85 |
| -### Parameter description |
| 106 | +Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions. |
86 | 107 |
|
87 |
| -* `x, --xxxx`: This parameter does something nice |
88 | 108 |
|
89 |
| -* ... |
| 109 | +<!-- ## Getting Started |
90 | 110 |
|
91 |
| -* `z, --zzzz`: This parameter does something even nicer |
| 111 | +> **DO NOT CLONE OR FORK** |
92 | 112 |
|
93 |
| -## Development |
| 113 | +If you want to set up this template: |
94 | 114 |
|
95 |
| -Read the FAQs in [ABOUT_THIS_TEMPLATE.md](ABOUT_THIS_TEMPLATE.md) to learn more about how this template works and where you should put your classes & methods. Make sure you've correctly installed `requirements-dev.txt` dependencies |
| 115 | +1. Request a repository on UKP Lab's GitHub by following the standard procedure on the wiki. It will install the template directly. Alternatively, set it up in your personal GitHub account by clicking **[Use this template](https://github.com/rochacbruno/python-project-template/generate)**. |
| 116 | +2. Wait until the first run of CI finishes. Github Actions will commit to your new repo with a "✅ Ready to clone and code" message. |
| 117 | +3. Delete optional files: |
| 118 | + - If you don't need automatic documentation generation, you can delete folder `docs`, file `.github\workflows\docs.yml` and `mkdocs.yml` |
| 119 | + - If you don't want automatic testing, you can delete folder `tests` and file `.github\workflows\tests.yml` |
| 120 | + - If you do not wish to have a project page, delete folder `static` and files `.nojekyll`, `index.html` |
| 121 | +4. Read the file [ABOUT_THIS_TEMPLATE.md](ABOUT_THIS_TEMPLATE.md) for more information about development. --> |
96 | 122 |
|
97 | 123 | ## Cite
|
98 | 124 |
|
99 |
| -Please use the following citation: |
| 125 | +If you use this repository, our trained SPARE-PRM model or our work, please cite: |
100 | 126 |
|
101 | 127 | ```
|
102 | 128 | @misc{rizvi2024spare,
|
|
0 commit comments