-
Notifications
You must be signed in to change notification settings - Fork 102
Description
By default, ollama will use the num_ctx
set in a modelfile parameters, or fall back to a low value between 1k and 8k. I think the default depends on how ollama is used (cli vs api).
In a chat, I can change the context window with /set parameter num_ctx 131072
for much higher memory usage and full context of llama3.2
.
In the API, the options
object can take a num_ctx
https://ollama.readthedocs.io/en/api/#request_7 .
For some tasks, we want a much higher context window.
Workaround
The currently available option is to create a new model with num_ctx
, like installing a new Modelfile or running /set parameter num_ctx 20000
followed by /save llama3.2-20k_ctx
.
Or set the global default when starting ollama, with environment variable OLLAMA_CONTEXT_LENGTH=20000
.
ollama logs:
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
Details and notes
Enable debug logging
OLLAMA_DEBUG=1 ollama serve
Ollama will log during model loading, pay attention to runner.num_ctx=8192
:
time=2025-06-17T15:46:27.259+02:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/llama3.2:latest runner.inference=metal runner.devices=1 runner.size="3.3 GiB" runner.vram="3.3 GiB" runner.parallel=2 runner.pid=36317 runner.model=/Users/kristian/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=8192
llama_context: constructing llama_context
llama_context: n_seq_max = 2
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 1024
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
If I do /set parameter num_ctx 5
in a ollama run llama3.2:latest
chat, I get a stupid assistant and this log:
time=2025-06-17T15:50:28.050+02:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/llama3.2:latest runner.inference=metal runner.devices=1 runner.size="2.8 GiB" runner.vram="2.8 GiB" runner.parallel=2 runner.pid=37551 runner.model=/Users/kristian/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=10
llama_context: constructing llama_context
llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: n_seq_max = 2
llama_context: n_ctx = 10
llama_context: n_ctx_per_seq = 5
llama_context: n_batch = 64
llama_context: n_ubatch = 64
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (5) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Also lots of warnings:
time=2025-06-17T15:50:29.721+02:00 level=DEBUG source=cache.go:240 msg="context limit hit - shifting" id=0 limit=5 input=5 keep=4 discard=1
See also ollama/ollama#2714