You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Manager for running millions of PDFs through a batch inference pipeline
@@ -270,7 +268,8 @@ positional arguments:
270
268
271
269
options:
272
270
-h, --help show this help message and exit
273
-
--pdfs PDFS Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
271
+
--pdfs [PDFS ...] Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
272
+
--model MODEL Path where the model is located, allenai/olmOCR-7B-0725-FP8 is the default, can be local, s3, or hugging face.
274
273
--workspace_profile WORKSPACE_PROFILE
275
274
S3 configuration profile for accessing the workspace
276
275
--pdf_profile PDF_PROFILE
@@ -285,20 +284,24 @@ options:
285
284
--apply_filter Apply basic filtering to English pdfs which are not forms, and not likely seo spam
286
285
--stats Instead of running any job, reports some statistics about the current workspace
287
286
--markdown Also write natural text to markdown files preserving the folder structure of the input pdfs
288
-
--model MODEL List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the
289
-
one which is fastest to access
290
-
--gpu-memory-utilization GPU_MEMORY_UTILIZATION
291
-
Fraction of VRAM vLLM may pre-allocate for KV-cache (passed through to vllm serve).
292
-
--max_model_len MAX_MODEL_LEN
293
-
Upper bound (tokens) vLLM will allocate KV-cache for; passed through to vllm serve as --max-model-len.
294
-
--model_max_context MODEL_MAX_CONTEXT
295
-
Maximum context length that the model was fine tuned under
Copy file name to clipboardExpand all lines: olmocr/pipeline.py
+22-18Lines changed: 22 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -1009,6 +1009,14 @@ async def main():
1009
1009
help="Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths",
1010
1010
default=None,
1011
1011
)
1012
+
parser.add_argument(
1013
+
"--model",
1014
+
help="Path where the model is located, allenai/olmOCR-7B-0725-FP8 is the default, can be local, s3, or hugging face.",
1015
+
default="allenai/olmOCR-7B-0725-FP8",
1016
+
)
1017
+
1018
+
1019
+
# More detailed config options, usually you shouldn't have to change these
1012
1020
parser.add_argument("--workspace_profile", help="S3 configuration profile for accessing the workspace", default=None)
1013
1021
parser.add_argument("--pdf_profile", help="S3 configuration profile for accessing the raw pdf documents", default=None)
1014
1022
parser.add_argument("--pages_per_group", type=int, default=500, help="Aiming for this many pdf pages per work item group")
@@ -1019,33 +1027,29 @@ async def main():
1019
1027
parser.add_argument("--stats", action="store_true", help="Instead of running any job, reports some statistics about the current workspace")
1020
1028
parser.add_argument("--markdown", action="store_true", help="Also write natural text to markdown files preserving the folder structure of the input pdfs")
1021
1029
1022
-
# Model parameters
1023
-
parser.add_argument(
1024
-
"--model",
1025
-
help="List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the one which is fastest to access",
1026
-
default="allenai/olmOCR-7B-0725-FP8",
1027
-
)
1028
-
1029
-
parser.add_argument("--gpu-memory-utilization", type=float, help="Fraction of VRAM vLLM may pre-allocate for KV-cache ""(passed through to vllm serve).")
1030
-
parser.add_argument("--max_model_len", type=int, default=16384, help="Upper bound (tokens) vLLM will allocate KV-cache for, lower if VLLM won't start")
1031
-
1032
1030
parser.add_argument("--target_longest_image_dim", type=int, help="Dimension on longest side to use for rendering the pdf pages", default=1288)
1033
1031
parser.add_argument("--target_anchor_text_len", type=int, help="Maximum amount of anchor text to use (characters), not used for new models", default=-1)
1034
1032
parser.add_argument("--guided_decoding", action="store_true", help="Enable guided decoding for model YAML type outputs")
vllm_group.add_argument("--gpu-memory-utilization", type=float, help="Fraction of VRAM vLLM may pre-allocate for KV-cache ""(passed through to vllm serve).")
1036
+
vllm_group.add_argument("--max_model_len", type=int, default=16384, help="Upper bound (tokens) vLLM will allocate KV-cache for, lower if VLLM won't start")
1037
+
vllm_group.add_argument("--tensor-parallel-size", "-tp", type=int, default=1, help="Tensor parallel size for vLLM")
1038
+
vllm_group.add_argument("--data-parallel-size", "-dp", type=int, default=1, help="Data parallel size for vLLM")
1039
+
vllm_group.add_argument("--port", type=int, default=30024, help="Port to use for the VLLM server")
1040
+
1036
1041
# Beaker/job running stuff
1037
-
parser.add_argument("--beaker", action="store_true", help="Submit this job to beaker instead of running locally")
1038
-
parser.add_argument("--beaker_workspace", help="Beaker workspace to submit to", default="ai2/olmocr")
This guide provides comprehensive instructions for training olmOCR models, including what you need to reproduce https://huggingface.co/allenai/olmOCR-7B-0725-FP8 on your own hardware.
4
+
5
+
## Environment setup
6
+
7
+
The first step is to setup your python/conda environment, and set things up the same way as for running olmocr.
[olmOCR-7B-0725-FP8](https://huggingface.co/allenai/olmOCR-7B-0725-FP8) was trained with [qwen25_vl_olmocrv2_2epoch.yaml](/olmcr/train/configs/qwen25_vl_olmocrv2_2epoch.yaml)
63
+
64
+
This is setup to train on a single B200 GPU, and training will take around 48 hours (~$300 if renting).
65
+
Single epoch runs will take half the time and will only lose ~1 point on olmOCR-bench.
66
+
67
+
But this is training for ~250,000 pages per epoch, so it's quite a big endeavour. We hope to add more options to make further finetuning your own small model more simple and easy.
0 commit comments