Docs and parameter groups

jakep-allenai · jakep-allenai · commit 4a70b2ebad97 · 2025-07-24T18:47:03.000Z
diff --git a/README.md b/README.md
@@ -256,11 +256,9 @@ python -m olmocr.pipeline ./localworkspace --markdown --pdfs olmocr-sample.pdf
 
 ```bash
 python -m olmocr.pipeline --help
-usage: pipeline.py [-h] [--pdfs [PDFS ...]] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES]
-                   [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--apply_filter] [--stats] [--markdown] [--model MODEL] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
-                   [--max_model_len MAX_MODEL_LEN] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM]
-                   [--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER] [--beaker_gpus BEAKER_GPUS]
-                   [--beaker_priority BEAKER_PRIORITY] [--port PORT] [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--data-parallel-size DATA_PARALLEL_SIZE]
+usage: pipeline.py [-h] [--pdfs [PDFS ...]] [--model MODEL] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS]
+                   [--apply_filter] [--stats] [--markdown] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM] [--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--guided_decoding] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION] [--max_model_len MAX_MODEL_LEN]
+                   [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--data-parallel-size DATA_PARALLEL_SIZE] [--port PORT] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER] [--beaker_gpus BEAKER_GPUS] [--beaker_priority BEAKER_PRIORITY]
                    workspace
 
 Manager for running millions of PDFs through a batch inference pipeline
@@ -270,7 +268,8 @@ positional arguments:
 
 options:
   -h, --help            show this help message and exit
-  --pdfs PDFS           Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
+  --pdfs [PDFS ...]     Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
+  --model MODEL         Path where the model is located, allenai/olmOCR-7B-0725-FP8 is the default, can be local, s3, or hugging face.
   --workspace_profile WORKSPACE_PROFILE
                         S3 configuration profile for accessing the workspace
   --pdf_profile PDF_PROFILE
@@ -285,20 +284,24 @@ options:
   --apply_filter        Apply basic filtering to English pdfs which are not forms, and not likely seo spam
   --stats               Instead of running any job, reports some statistics about the current workspace
   --markdown            Also write natural text to markdown files preserving the folder structure of the input pdfs
-  --model MODEL         List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the
-                        one which is fastest to access
-  --gpu-memory-utilization GPU_MEMORY_UTILIZATION
-                        Fraction of VRAM vLLM may pre-allocate for KV-cache (passed through to vllm serve).
-  --max_model_len MAX_MODEL_LEN
-                        Upper bound (tokens) vLLM will allocate KV-cache for; passed through to vllm serve as --max-model-len.
-  --model_max_context MODEL_MAX_CONTEXT
-                        Maximum context length that the model was fine tuned under
-  --model_chat_template MODEL_CHAT_TEMPLATE
-                        Chat template to pass to sglang server
   --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
                         Dimension on longest side to use for rendering the pdf pages
   --target_anchor_text_len TARGET_ANCHOR_TEXT_LEN
-                        Maximum amount of anchor text to use (characters)
+                        Maximum amount of anchor text to use (characters), not used for new models
+  --guided_decoding     Enable guided decoding for model YAML type outputs
+
+VLLM Forwarded arguments:
+  --gpu-memory-utilization GPU_MEMORY_UTILIZATION
+                        Fraction of VRAM vLLM may pre-allocate for KV-cache (passed through to vllm serve).
+  --max_model_len MAX_MODEL_LEN
+                        Upper bound (tokens) vLLM will allocate KV-cache for, lower if VLLM won't start
+  --tensor-parallel-size TENSOR_PARALLEL_SIZE, -tp TENSOR_PARALLEL_SIZE
+                        Tensor parallel size for vLLM
+  --data-parallel-size DATA_PARALLEL_SIZE, -dp DATA_PARALLEL_SIZE
+                        Data parallel size for vLLM
+  --port PORT           Port to use for the VLLM server
+
+beaker/cluster execution:
   --beaker              Submit this job to beaker instead of running locally
   --beaker_workspace BEAKER_WORKSPACE
                         Beaker workspace to submit to
diff --git a/olmocr/pipeline.py b/olmocr/pipeline.py
@@ -1009,6 +1009,14 @@ async def main():
         help="Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths",
         default=None,
     )
+    parser.add_argument(
+        "--model",
+        help="Path where the model is located, allenai/olmOCR-7B-0725-FP8 is the default, can be local, s3, or hugging face.",
+        default="allenai/olmOCR-7B-0725-FP8",
+    )
+
+
+    # More detailed config options, usually you shouldn't have to change these
     parser.add_argument("--workspace_profile", help="S3 configuration profile for accessing the workspace", default=None)
     parser.add_argument("--pdf_profile", help="S3 configuration profile for accessing the raw pdf documents", default=None)
     parser.add_argument("--pages_per_group", type=int, default=500, help="Aiming for this many pdf pages per work item group")
@@ -1019,33 +1027,29 @@ async def main():
     parser.add_argument("--stats", action="store_true", help="Instead of running any job, reports some statistics about the current workspace")
     parser.add_argument("--markdown", action="store_true", help="Also write natural text to markdown files preserving the folder structure of the input pdfs")
 
-    # Model parameters
-    parser.add_argument(
-        "--model",
-        help="List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the one which is fastest to access",
-        default="allenai/olmOCR-7B-0725-FP8",
-    )
-
-    parser.add_argument("--gpu-memory-utilization", type=float, help="Fraction of VRAM vLLM may pre-allocate for KV-cache " "(passed through to vllm serve).")
-    parser.add_argument("--max_model_len", type=int, default=16384, help="Upper bound (tokens) vLLM will allocate KV-cache for, lower if VLLM won't start")
-
     parser.add_argument("--target_longest_image_dim", type=int, help="Dimension on longest side to use for rendering the pdf pages", default=1288)
     parser.add_argument("--target_anchor_text_len", type=int, help="Maximum amount of anchor text to use (characters), not used for new models", default=-1)
     parser.add_argument("--guided_decoding", action="store_true", help="Enable guided decoding for model YAML type outputs")
 
+    vllm_group = parser.add_argument_group("VLLM Forwarded arguments")
+    vllm_group.add_argument("--gpu-memory-utilization", type=float, help="Fraction of VRAM vLLM may pre-allocate for KV-cache " "(passed through to vllm serve).")
+    vllm_group.add_argument("--max_model_len", type=int, default=16384, help="Upper bound (tokens) vLLM will allocate KV-cache for, lower if VLLM won't start")
+    vllm_group.add_argument("--tensor-parallel-size", "-tp", type=int, default=1, help="Tensor parallel size for vLLM")
+    vllm_group.add_argument("--data-parallel-size", "-dp", type=int, default=1, help="Data parallel size for vLLM")
+    vllm_group.add_argument("--port", type=int, default=30024, help="Port to use for the VLLM server")
+
     # Beaker/job running stuff
-    parser.add_argument("--beaker", action="store_true", help="Submit this job to beaker instead of running locally")
-    parser.add_argument("--beaker_workspace", help="Beaker workspace to submit to", default="ai2/olmocr")
-    parser.add_argument(
+    beaker_group = parser.add_argument_group("beaker/cluster execution")
+    beaker_group.add_argument("--beaker", action="store_true", help="Submit this job to beaker instead of running locally")
+    beaker_group.add_argument("--beaker_workspace", help="Beaker workspace to submit to", default="ai2/olmocr")
+    beaker_group.add_argument(
         "--beaker_cluster",
         help="Beaker clusters you want to run on",
         default=["ai2/jupiter-cirrascale-2", "ai2/ceres-cirrascale", "ai2/neptune-cirrascale", "ai2/saturn-cirrascale", "ai2/augusta-google-1"],
     )
-    parser.add_argument("--beaker_gpus", type=int, default=1, help="Number of gpu replicas to run")
-    parser.add_argument("--beaker_priority", type=str, default="normal", help="Beaker priority level for the job")
-    parser.add_argument("--port", type=int, default=30024, help="Port to use for the VLLM server")
-    parser.add_argument("--tensor-parallel-size", "-tp", type=int, default=1, help="Tensor parallel size for vLLM")
-    parser.add_argument("--data-parallel-size", "-dp", type=int, default=1, help="Data parallel size for vLLM")
+    beaker_group.add_argument("--beaker_gpus", type=int, default=1, help="Number of gpu replicas to run")
+    beaker_group.add_argument("--beaker_priority", type=str, default="normal", help="Beaker priority level for the job")
+
     args = parser.parse_args()
 
     logger.info(
diff --git a/olmocr/train/README.md b/olmocr/train/README.md
@@ -0,0 +1,106 @@
+# olmOCR Training Guide
+
+This guide provides comprehensive instructions for training olmOCR models, including what you need to reproduce https://huggingface.co/allenai/olmOCR-7B-0725-FP8 on your own hardware.
+
+## Environment setup
+
+The first step is to setup your python/conda environment, and set things up the same way as for running olmocr.
+
+Then, add in some extra training requirements:
+
+```bash
+pip install -r gantry-train-requirements.txt
+pip install transformers==4.52.4
+pip install flash-attn==2.8.0.post2 --no-build-isolation
+```
+
+
+### Dataset Format
+
+The training data should be organized as pairs of PDF files and their corresponding markdown annotations:
+
+**Important: Each PDF needs to be a single page only!** 
+
+```
+data/
+├── document1.pdf
+├── document1.md
+├── document2.pdf
+├── document2.md
+└── ...
+```
+
+Each markdown file should contain:
+1. YAML front matter with metadata
+2. The extracted text content
+
+Example markdown format:
+```markdown
+---
+primary_language: en
+is_rotation_valid: True
+rotation_correction: 0
+is_table: False
+is_diagram: False
+---
+Document text goes here...
+```
+
+The easiest way to grab a lot of files in this format is to use `prepare_olmocrmix.py` which will automatically download and prepare 
+[olmOCR-mix-0225](https://huggingface.co/datasets/allenai/olmOCR-mix-0225) for your environment.
+
+```bash
+# Caution, requires ~200GB of disk space
+python olmocr/train/prepare_olmocrmix.py --subset 01_books --split eval_iabooks --destination ~/olmOCR-mix-0225/
+python olmocr/train/prepare_olmocrmix.py --subset 01_books --split train_iabooks --destination ~/olmOCR-mix-0225/
+python olmocr/train/prepare_olmocrmix.py --subset 00_documents --split eval_s2pdf --destination ~/olmOCR-mix-0225/
+python olmocr/train/prepare_olmocrmix.py --subset 00_documents --split train_s2pdf --destination ~/olmOCR-mix-0225/
+```
+
+### Setup your config
+
+[olmOCR-7B-0725-FP8](https://huggingface.co/allenai/olmOCR-7B-0725-FP8) was trained with [qwen25_vl_olmocrv2_2epoch.yaml](/olmcr/train/configs/qwen25_vl_olmocrv2_2epoch.yaml)
+
+This is setup to train on a single B200 GPU, and training will take around 48 hours (~$300 if renting). 
+Single epoch runs will take half the time and will only lose ~1 point on olmOCR-bench.
+
+But this is training for ~250,000 pages per epoch, so it's quite a big endeavour. We hope to add more options to make further finetuning your own small model more simple and easy.
+
+### Launch training
+
+```bash
+python -m olmocr.train.train --config olmcr/train/configs/qwen25_vl_olmocrv2_2epoch.yaml
+```
+
+### Prepare Checkpoints and Quantize
+
+After training is done, you will need to call `prepare_checkpoint.py` to take the saved checkpoints
+and get them ready for use with VLLM.
+
+```bash
+python -m olmocr.train.prepare_olmocr_checkpoint [source dir]/checkpoint-7648 [destination]
+```
+
+And finally, we recommend doing an FP8 quantization step, whose performance is solidly in the error bars of the raw
+bfloat16 model, but uses less memory and inferences around 12% faster.
+
+```bash
+python -m olmocr.train.compress_checkpoint --config olmocr/train/quantization_configs/qwen2_5vl_w8a8_fp8.yaml [destination] [destination-FP8]
+```
+
+### Notes for AI2
+If you are a collaborator of AI2, you can use the following scripts to run training and inference
+
+```bash
+# Run training using Beaker
+scripts/train/newtrainer-beaker.sh --config [config file]
+
+# Prepare checkpoint from an interactive session with WEKA
+python -m olmocr.train.prepare_olmocr_checkpoint [source] [destination]
+
+# Compress the prepared model checkpoint to FP8
+scripts/train/compress_model.sh <recipe_path> <input_model_path> <output_model_path>[--calibration-pdfs PATTERN]
+
+# Run olmOCR bench
+scripts/run_benchmark.sh --model [destination]
+```