Skip to content

[Suggestion] Ability to send num_ctx parameter to Ollama to change Context Window size #295

@krissrex

Description

@krissrex

By default, ollama will use the num_ctx set in a modelfile parameters, or fall back to a low value between 1k and 8k. I think the default depends on how ollama is used (cli vs api).

In a chat, I can change the context window with /set parameter num_ctx 131072 for much higher memory usage and full context of llama3.2.
In the API, the options object can take a num_ctx https://ollama.readthedocs.io/en/api/#request_7 .

For some tasks, we want a much higher context window.

Workaround

The currently available option is to create a new model with num_ctx, like installing a new Modelfile or running /set parameter num_ctx 20000 followed by /save llama3.2-20k_ctx .
Or set the global default when starting ollama, with environment variable OLLAMA_CONTEXT_LENGTH=20000.

ollama logs:

llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1

Details and notes

Enable debug logging

OLLAMA_DEBUG=1 ollama serve

Ollama will log during model loading, pay attention to runner.num_ctx=8192:

time=2025-06-17T15:46:27.259+02:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/llama3.2:latest runner.inference=metal runner.devices=1 runner.size="3.3 GiB" runner.vram="3.3 GiB" runner.parallel=2 runner.pid=36317 runner.model=/Users/kristian/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=8192
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized

If I do /set parameter num_ctx 5 in a ollama run llama3.2:latest chat, I get a stupid assistant and this log:

time=2025-06-17T15:50:28.050+02:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/llama3.2:latest runner.inference=metal runner.devices=1 runner.size="2.8 GiB" runner.vram="2.8 GiB" runner.parallel=2 runner.pid=37551 runner.model=/Users/kristian/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff runner.num_ctx=10
llama_context: constructing llama_context
llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 10
llama_context: n_ctx_per_seq = 5
llama_context: n_batch       = 64
llama_context: n_ubatch      = 64
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (5) < n_ctx_train (131072) -- the full capacity of the model will not be utilized

Also lots of warnings:

time=2025-06-17T15:50:29.721+02:00 level=DEBUG source=cache.go:240 msg="context limit hit - shifting" id=0 limit=5 input=5 keep=4 discard=1

See also ollama/ollama#2714

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions