Skip to content

Commit 4a70b2e

Browse files
committed
Docs and parameter groups
1 parent fc983ca commit 4a70b2e

File tree

3 files changed

+148
-35
lines changed

3 files changed

+148
-35
lines changed

README.md

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -256,11 +256,9 @@ python -m olmocr.pipeline ./localworkspace --markdown --pdfs olmocr-sample.pdf
256256

257257
```bash
258258
python -m olmocr.pipeline --help
259-
usage: pipeline.py [-h] [--pdfs [PDFS ...]] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES]
260-
[--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--apply_filter] [--stats] [--markdown] [--model MODEL] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
261-
[--max_model_len MAX_MODEL_LEN] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM]
262-
[--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER] [--beaker_gpus BEAKER_GPUS]
263-
[--beaker_priority BEAKER_PRIORITY] [--port PORT] [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--data-parallel-size DATA_PARALLEL_SIZE]
259+
usage: pipeline.py [-h] [--pdfs [PDFS ...]] [--model MODEL] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS]
260+
[--apply_filter] [--stats] [--markdown] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM] [--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--guided_decoding] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION] [--max_model_len MAX_MODEL_LEN]
261+
[--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--data-parallel-size DATA_PARALLEL_SIZE] [--port PORT] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER] [--beaker_gpus BEAKER_GPUS] [--beaker_priority BEAKER_PRIORITY]
264262
workspace
265263

266264
Manager for running millions of PDFs through a batch inference pipeline
@@ -270,7 +268,8 @@ positional arguments:
270268

271269
options:
272270
-h, --help show this help message and exit
273-
--pdfs PDFS Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
271+
--pdfs [PDFS ...] Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
272+
--model MODEL Path where the model is located, allenai/olmOCR-7B-0725-FP8 is the default, can be local, s3, or hugging face.
274273
--workspace_profile WORKSPACE_PROFILE
275274
S3 configuration profile for accessing the workspace
276275
--pdf_profile PDF_PROFILE
@@ -285,20 +284,24 @@ options:
285284
--apply_filter Apply basic filtering to English pdfs which are not forms, and not likely seo spam
286285
--stats Instead of running any job, reports some statistics about the current workspace
287286
--markdown Also write natural text to markdown files preserving the folder structure of the input pdfs
288-
--model MODEL List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the
289-
one which is fastest to access
290-
--gpu-memory-utilization GPU_MEMORY_UTILIZATION
291-
Fraction of VRAM vLLM may pre-allocate for KV-cache (passed through to vllm serve).
292-
--max_model_len MAX_MODEL_LEN
293-
Upper bound (tokens) vLLM will allocate KV-cache for; passed through to vllm serve as --max-model-len.
294-
--model_max_context MODEL_MAX_CONTEXT
295-
Maximum context length that the model was fine tuned under
296-
--model_chat_template MODEL_CHAT_TEMPLATE
297-
Chat template to pass to sglang server
298287
--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
299288
Dimension on longest side to use for rendering the pdf pages
300289
--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN
301-
Maximum amount of anchor text to use (characters)
290+
Maximum amount of anchor text to use (characters), not used for new models
291+
--guided_decoding Enable guided decoding for model YAML type outputs
292+
293+
VLLM Forwarded arguments:
294+
--gpu-memory-utilization GPU_MEMORY_UTILIZATION
295+
Fraction of VRAM vLLM may pre-allocate for KV-cache (passed through to vllm serve).
296+
--max_model_len MAX_MODEL_LEN
297+
Upper bound (tokens) vLLM will allocate KV-cache for, lower if VLLM won't start
298+
--tensor-parallel-size TENSOR_PARALLEL_SIZE, -tp TENSOR_PARALLEL_SIZE
299+
Tensor parallel size for vLLM
300+
--data-parallel-size DATA_PARALLEL_SIZE, -dp DATA_PARALLEL_SIZE
301+
Data parallel size for vLLM
302+
--port PORT Port to use for the VLLM server
303+
304+
beaker/cluster execution:
302305
--beaker Submit this job to beaker instead of running locally
303306
--beaker_workspace BEAKER_WORKSPACE
304307
Beaker workspace to submit to

olmocr/pipeline.py

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1009,6 +1009,14 @@ async def main():
10091009
help="Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths",
10101010
default=None,
10111011
)
1012+
parser.add_argument(
1013+
"--model",
1014+
help="Path where the model is located, allenai/olmOCR-7B-0725-FP8 is the default, can be local, s3, or hugging face.",
1015+
default="allenai/olmOCR-7B-0725-FP8",
1016+
)
1017+
1018+
1019+
# More detailed config options, usually you shouldn't have to change these
10121020
parser.add_argument("--workspace_profile", help="S3 configuration profile for accessing the workspace", default=None)
10131021
parser.add_argument("--pdf_profile", help="S3 configuration profile for accessing the raw pdf documents", default=None)
10141022
parser.add_argument("--pages_per_group", type=int, default=500, help="Aiming for this many pdf pages per work item group")
@@ -1019,33 +1027,29 @@ async def main():
10191027
parser.add_argument("--stats", action="store_true", help="Instead of running any job, reports some statistics about the current workspace")
10201028
parser.add_argument("--markdown", action="store_true", help="Also write natural text to markdown files preserving the folder structure of the input pdfs")
10211029

1022-
# Model parameters
1023-
parser.add_argument(
1024-
"--model",
1025-
help="List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the one which is fastest to access",
1026-
default="allenai/olmOCR-7B-0725-FP8",
1027-
)
1028-
1029-
parser.add_argument("--gpu-memory-utilization", type=float, help="Fraction of VRAM vLLM may pre-allocate for KV-cache " "(passed through to vllm serve).")
1030-
parser.add_argument("--max_model_len", type=int, default=16384, help="Upper bound (tokens) vLLM will allocate KV-cache for, lower if VLLM won't start")
1031-
10321030
parser.add_argument("--target_longest_image_dim", type=int, help="Dimension on longest side to use for rendering the pdf pages", default=1288)
10331031
parser.add_argument("--target_anchor_text_len", type=int, help="Maximum amount of anchor text to use (characters), not used for new models", default=-1)
10341032
parser.add_argument("--guided_decoding", action="store_true", help="Enable guided decoding for model YAML type outputs")
10351033

1034+
vllm_group = parser.add_argument_group("VLLM Forwarded arguments")
1035+
vllm_group.add_argument("--gpu-memory-utilization", type=float, help="Fraction of VRAM vLLM may pre-allocate for KV-cache " "(passed through to vllm serve).")
1036+
vllm_group.add_argument("--max_model_len", type=int, default=16384, help="Upper bound (tokens) vLLM will allocate KV-cache for, lower if VLLM won't start")
1037+
vllm_group.add_argument("--tensor-parallel-size", "-tp", type=int, default=1, help="Tensor parallel size for vLLM")
1038+
vllm_group.add_argument("--data-parallel-size", "-dp", type=int, default=1, help="Data parallel size for vLLM")
1039+
vllm_group.add_argument("--port", type=int, default=30024, help="Port to use for the VLLM server")
1040+
10361041
# Beaker/job running stuff
1037-
parser.add_argument("--beaker", action="store_true", help="Submit this job to beaker instead of running locally")
1038-
parser.add_argument("--beaker_workspace", help="Beaker workspace to submit to", default="ai2/olmocr")
1039-
parser.add_argument(
1042+
beaker_group = parser.add_argument_group("beaker/cluster execution")
1043+
beaker_group.add_argument("--beaker", action="store_true", help="Submit this job to beaker instead of running locally")
1044+
beaker_group.add_argument("--beaker_workspace", help="Beaker workspace to submit to", default="ai2/olmocr")
1045+
beaker_group.add_argument(
10401046
"--beaker_cluster",
10411047
help="Beaker clusters you want to run on",
10421048
default=["ai2/jupiter-cirrascale-2", "ai2/ceres-cirrascale", "ai2/neptune-cirrascale", "ai2/saturn-cirrascale", "ai2/augusta-google-1"],
10431049
)
1044-
parser.add_argument("--beaker_gpus", type=int, default=1, help="Number of gpu replicas to run")
1045-
parser.add_argument("--beaker_priority", type=str, default="normal", help="Beaker priority level for the job")
1046-
parser.add_argument("--port", type=int, default=30024, help="Port to use for the VLLM server")
1047-
parser.add_argument("--tensor-parallel-size", "-tp", type=int, default=1, help="Tensor parallel size for vLLM")
1048-
parser.add_argument("--data-parallel-size", "-dp", type=int, default=1, help="Data parallel size for vLLM")
1050+
beaker_group.add_argument("--beaker_gpus", type=int, default=1, help="Number of gpu replicas to run")
1051+
beaker_group.add_argument("--beaker_priority", type=str, default="normal", help="Beaker priority level for the job")
1052+
10491053
args = parser.parse_args()
10501054

10511055
logger.info(

olmocr/train/README.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# olmOCR Training Guide
2+
3+
This guide provides comprehensive instructions for training olmOCR models, including what you need to reproduce https://huggingface.co/allenai/olmOCR-7B-0725-FP8 on your own hardware.
4+
5+
## Environment setup
6+
7+
The first step is to setup your python/conda environment, and set things up the same way as for running olmocr.
8+
9+
Then, add in some extra training requirements:
10+
11+
```bash
12+
pip install -r gantry-train-requirements.txt
13+
pip install transformers==4.52.4
14+
pip install flash-attn==2.8.0.post2 --no-build-isolation
15+
```
16+
17+
18+
### Dataset Format
19+
20+
The training data should be organized as pairs of PDF files and their corresponding markdown annotations:
21+
22+
**Important: Each PDF needs to be a single page only!**
23+
24+
```
25+
data/
26+
├── document1.pdf
27+
├── document1.md
28+
├── document2.pdf
29+
├── document2.md
30+
└── ...
31+
```
32+
33+
Each markdown file should contain:
34+
1. YAML front matter with metadata
35+
2. The extracted text content
36+
37+
Example markdown format:
38+
```markdown
39+
---
40+
primary_language: en
41+
is_rotation_valid: True
42+
rotation_correction: 0
43+
is_table: False
44+
is_diagram: False
45+
---
46+
Document text goes here...
47+
```
48+
49+
The easiest way to grab a lot of files in this format is to use `prepare_olmocrmix.py` which will automatically download and prepare
50+
[olmOCR-mix-0225](https://huggingface.co/datasets/allenai/olmOCR-mix-0225) for your environment.
51+
52+
```bash
53+
# Caution, requires ~200GB of disk space
54+
python olmocr/train/prepare_olmocrmix.py --subset 01_books --split eval_iabooks --destination ~/olmOCR-mix-0225/
55+
python olmocr/train/prepare_olmocrmix.py --subset 01_books --split train_iabooks --destination ~/olmOCR-mix-0225/
56+
python olmocr/train/prepare_olmocrmix.py --subset 00_documents --split eval_s2pdf --destination ~/olmOCR-mix-0225/
57+
python olmocr/train/prepare_olmocrmix.py --subset 00_documents --split train_s2pdf --destination ~/olmOCR-mix-0225/
58+
```
59+
60+
### Setup your config
61+
62+
[olmOCR-7B-0725-FP8](https://huggingface.co/allenai/olmOCR-7B-0725-FP8) was trained with [qwen25_vl_olmocrv2_2epoch.yaml](/olmcr/train/configs/qwen25_vl_olmocrv2_2epoch.yaml)
63+
64+
This is setup to train on a single B200 GPU, and training will take around 48 hours (~$300 if renting).
65+
Single epoch runs will take half the time and will only lose ~1 point on olmOCR-bench.
66+
67+
But this is training for ~250,000 pages per epoch, so it's quite a big endeavour. We hope to add more options to make further finetuning your own small model more simple and easy.
68+
69+
### Launch training
70+
71+
```bash
72+
python -m olmocr.train.train --config olmcr/train/configs/qwen25_vl_olmocrv2_2epoch.yaml
73+
```
74+
75+
### Prepare Checkpoints and Quantize
76+
77+
After training is done, you will need to call `prepare_checkpoint.py` to take the saved checkpoints
78+
and get them ready for use with VLLM.
79+
80+
```bash
81+
python -m olmocr.train.prepare_olmocr_checkpoint [source dir]/checkpoint-7648 [destination]
82+
```
83+
84+
And finally, we recommend doing an FP8 quantization step, whose performance is solidly in the error bars of the raw
85+
bfloat16 model, but uses less memory and inferences around 12% faster.
86+
87+
```bash
88+
python -m olmocr.train.compress_checkpoint --config olmocr/train/quantization_configs/qwen2_5vl_w8a8_fp8.yaml [destination] [destination-FP8]
89+
```
90+
91+
### Notes for AI2
92+
If you are a collaborator of AI2, you can use the following scripts to run training and inference
93+
94+
```bash
95+
# Run training using Beaker
96+
scripts/train/newtrainer-beaker.sh --config [config file]
97+
98+
# Prepare checkpoint from an interactive session with WEKA
99+
python -m olmocr.train.prepare_olmocr_checkpoint [source] [destination]
100+
101+
# Compress the prepared model checkpoint to FP8
102+
scripts/train/compress_model.sh <recipe_path> <input_model_path> <output_model_path>[--calibration-pdfs PATTERN]
103+
104+
# Run olmOCR bench
105+
scripts/run_benchmark.sh --model [destination]
106+
```

0 commit comments

Comments
 (0)