Allow adjusting context window sizes for Ollama dynamically #335

ptitjes · 2025-06-25T18:28:19Z

This PR add the ability to customize Ollama chat requests as discussed in #295.

@aozherelyeva I had to make two modification to the OllamaTestFixture and I hope it's OK:

Added a check to pull the llama3.2 model in the container, because by default the image that you suggested in TESTING.md doesn't come with the llama3.2 model. Now it works without touching anything (provided you have docker obviously).
Exposed the baseUrl to allow some tests to initialize a custom OllamaClient if required.

Type of the change

New feature
Bug fix
Documentation fix

Checklist for all pull requests

The pull request has a description of the proposed change
I read the Contributing Guidelines before opening the pull request
The pull request uses develop as the base branch
Tests for the changes have been added
All new and existing tests passed

Additional steps for pull requests adding a new feature

An issue describing the proposed change exists
The pull request includes a link to the issue
The change was discussed and approved in the issue
Docs have been added / updated

ptitjes · 2025-06-25T18:35:57Z

And I had forgotten to add Kdocs. That's now fixed.

Ololoshechkin · 2025-06-25T22:57:18Z

...or-ollama-client/src/commonMain/kotlin/ai/koog/prompt/executor/ollama/client/OllamaClient.kt

+    private val requestBuilderAction: (OllamaRequestBuilder.(prompt: Prompt, model: LLModel) -> Unit)? = null,
+    private val clock: Clock = Clock.System,


Why not requestBuilderAction: OllamaRequestBuilder.(prompt: Prompt, model: LLModel) -> Unit = {} ?

also why do you need the prompt here as a parameter?

Also if you make this lambda the last parameter -- you would be able to write

val client = OllamaClient(...) { seed = 0 numCtx = 100 }

Also, could you please clarify once again -- why do you need to configure it with a lambda and not just pass a data object with this parameter, ex:

val client = OllamaClient(..., requestOptions = OllamaRequestOptions(seed = 0, numCtx = 100))

Why not requestBuilderAction: OllamaRequestBuilder.(prompt: Prompt, model: LLModel) -> Unit = {} ?

You're right making it nullable doesn't bring any real optimization.

Also if you make this lambda the last parameter

I was actually trying to avoid putting it as last parameter, because those are sensible parameters and I wanted for the user to consciously decide to implement the lambda, and not implement a trailing lambda because IDEA suggests it.

also why do you need the prompt here as a parameter?
could you please clarify once again -- why do you need to configure it with a lambda and not just pass a data object with this parameter

So, Ollama needs to allocate some VRAM to handle your context. num_ctx tells Ollama that you want it to be able to take num_ctx tokens in account when generating the next token. If you do not set num_ctx, it will use the constant value configured for the server (OLLAMA_CONTEXT_LENGTH, which is 4096 by default). Any token that doesn't fit in this context window, will be arbitrarily truncated!

When working with local models, you want the user to be able to finely tune this because you might be in a constrained environment. Also, you cannot ask the server admin to put a big value for the OLLAMA_CONTEXT_LENGTH, because a big context is often not needed by most of the requests.

The size of the context window needs to be decided at each request, if there is a big context in the prompt. If the user can estimate (either by looking at lastTokenUsage or using a tokenizer) that the prompt contains than 4096 tokens, then they need to allocate more context with num_ctx.

Proprietary LLMs behind APIs do the same. At some point, they have to decide how much VRAM they allocate for your request (or rather, on what node with preallocated models they will forward your request). You just don't see it.

Unfortunately, with Ollama, it is the responsibility of the caller to define the allocated context size (in the limits of the actual max context length of the model, obviously).

When handling long-running conversations or maybe doing RAG with lots of augmented data, a typical implementation would look something like, for instance:

requestBuilderAction = { prompt, model -> val tokenCount = prompt.lastTokenUsage val maxContextSize = model.contextSize // <-- custom extension property here numCtx = minOf(maxOf(2048, ((tokenCount + 500) / 1024 + 1) * 1024), maxContextSize) }

BTW, it seems nothing is done still on the AI Assistant side: https://youtrack.jetbrains.com/issue/LLM-13677/Lift-context-size-restrictions-for-local-Ollama-model-or-make-it-configurable

Ideally, the numCtx would be an LLMParam in the prompt. But there is currently no way to have provider-specific LLMParams.

The size of the context window needs to be decided at each request, if there is a big context in the prompt.

I think estimating the size based on the prompt may be tricky. You can't take into account the future output of the llm itself. It would be sufficient for me to just set the num_ctx to the max possible. But perhaps there are scenarios where this dynamic num_ctx is useful.

I assume this requestBuilderAction also runs before tool call results? So the llm can upsize in case of a big tool result.

Yes, it should be sufficient for my case of using a static context value. Anyone with specific needs could heuristically add some more overhead for tools and templates. LGTM 👍

Hi @ptitjes ! I actually looked through this conversation again, and it looks like adding an example to the examples module that would showcase the Ollama with fine-tuned dynamic context lenght based on the prompt and model -- will be ideal. Plus, if you could put some textual explanations there in this example, that would benefit everyone's understanding and enrich the knowledgebase around Koog. WDYT? :)

Also would be great to reuse the Tokenizer here for prompt estimation.

Other than that, might be a great idea to make the main kdoc above the OllamaClient much longer and explain all these details, and even provide a few code samples how to use this new configuration lambda (feel free to check the kdoc of AgentMemory.Feature or EventHandler.Feature for instance -- we already have quite long explanations in Kdocs :)). Also it's nice for LLMs to learn on these things once they read them :)

adding an example to the examples module that would showcase the Ollama with fine-tuned dynamic context lenght based on the prompt and model -- will be ideal. Plus, if you could put some textual explanations there in this example, that would benefit everyone's understanding and enrich the knowledgebase around Koog. WDYT? :)

make the main kdoc above the OllamaClient much longer and explain all these details

Yeah, definitely.

integration-tests/src/jvmTest/kotlin/ai/koog/integration/tests/OllamaClientIntegrationTest.kt

ptitjes · 2025-06-27T12:03:47Z

Rebased on develop and applied requested changes:

made requestBuilderAction the last (non-nullable) constructor parameter
checked that the response is the same when seed=0

ptitjes · 2025-07-01T09:59:11Z

I'll have to rebase because the llama3.2 pull of 1012015 was merged in #371.

ptitjes · 2025-07-02T17:00:06Z

Rebased on develop and removed the commit which @aozherelyeva also implemented in develop.

ptitjes · 2025-07-04T07:11:12Z

@Ololoshechkin @devcrocod @Rizzen I don't know if either of you had any time to take a second look at this? I was hoping we could maybe have this in the 0.3 release.

devcrocod

I’m worried that Ollama’s implementation is drifting further and further from the others.

Also, I think it would be better to add additional functions or constructors to creating so that I can store the request parameters in a variable and pass them that way. It’s more convenient when writing custom wrapper functions on top of koog

integration-tests/src/jvmMain/kotlin/ai/koog/integration/tests/OllamaTestFixture.kt

...or-ollama-client/src/commonMain/kotlin/ai/koog/prompt/executor/ollama/client/OllamaClient.kt

devcrocod · 2025-07-04T12:00:58Z

...llama-client/src/commonMain/kotlin/ai/koog/prompt/executor/ollama/client/dto/OllamaModels.kt

@@ -71,6 +71,9 @@ internal data class OllamaChatRequestDTO(
    @Serializable
    internal data class Options(
        val temperature: Double? = null,
+        val seed: Int? = null,
+        @SerialName("num_ctx") val numCtx: Int? = null,


Doesn’t the Ollama json config assume the use of snake_case flag, so that we don’t have to use @SerialName?

Well, there are @SerialName(...) annotations everywhere in the Ollama DTO definitions.

To be honest, I didn't even know about kotlinx-serialization namingStrategy option. So thanks for making me discover that.

However, it seems that using global naming strategies may not be the best idea. To quote JsonNamingStrategy's KDocs:

Controversy about using global naming strategies

Global naming strategies have one key trait that makes them a debatable and controversial topic: They are very implicit. It means that by looking only at the definition of the class, it is impossible to say which names it will have in the serialized form. As a consequence, naming strategies are not friendly to refactorings. Programmer renaming myId to userId may forget to rename my_id, and vice versa. Generally, any tools one can imagine work poorly with global naming strategies: Find Usages/Rename in IDE, full-text search by grep, etc. For them, the original name and the transformed are two different things; changing one without the other may introduce bugs in many unexpected ways. The lack of a single place of definition, the inability to use automated tools, and more error-prone code lead to greater maintenance efforts for code with global naming strategies. However, there are cases where usage of naming strategies is inevitable, such as interop with an existing API or migrating a large codebase. Therefore, one should carefully weigh the pros and cons before considering adding global naming strategies to an application.

Ideally, all of this should be caught by tests, so I don’t see a big issue with it
Besides, we already use namingStrategy in other clients
So far, the only real advantage is a slight reduction in code

I don’t think it’s critical, since we can remove it or add it to the other clients at any time if needed

ptitjes · 2025-07-04T15:36:31Z

I’m worried that Ollama’s implementation is drifting further and further from the others.

I understand but isn't that the goal of having different LLMClient? I.e. to alleviate the differences and take benefit of the specificities of each LLM server? Indeed in an ideal world, we would have a single LLMClient implementation.

Ollama is really different in behaviour from proprietary models behind APIs. Either we embrace it, or we will always have subpar support for Ollama.

I have faith that in the future small LLMs with great performances will become more and more common, and that we won't need proprietary models from private corporations as much to run quality AI agents. That is good for people and good for the climate. IMO Koog has to support these use-cases.

Also, I think it would be better to add additional functions or constructors to creating so that I can store the request parameters in a variable and pass them that way. It’s more convenient when writing custom wrapper functions on top of koog

I am really sorry, but I don't understand at all what you mean here. Would you mind please elaborate?

devcrocod · 2025-07-08T11:02:11Z

I understand but isn't that the goal of having different LLMClient? I.e. to alleviate the differences and take benefit of the specificities of each LLM server? Indeed in an ideal world, we would have a single LLMClient implementation.

I meant that if we add this for Ollama, it would be good to have something similar in the other clients as well. and ideally, all of it should go into the 0.3.0 release. This isn’t something you necessarily have to do right now, more like the definition of a new task
Alternatively, we could mark it as Experimental.
@Ololoshechkin do we have an Experimental annotation for Koog?

I am really sorry, but I don't understand at all what you mean here. Would you mind please elaborate?

What I’m getting at is that if we write our own function that creates the client internally and we want to pass parameters to it, we’ll have to expose RequestBuilder in the function signature, like this:

fun createOllamaClient(
    ...,
    requestBuilder: OllamaRequestBuilder.(Prompt, LLModel) -> Unit = { _, _ -> }
)

It’s better when we can use a simpler approach. Store our parameters in a variable and just pass them around.
But for that, need to have alternative constructors or factory functions. And rewrite OllamaRequestBuilder like that:

public class OllamaRequestBuilder(
    public var seed: Int? = null
    public var numCtx: Int? = null
    public var numPredict: Int? = null
) {
    fun validate()
    fun build(): OllamaChatRequestDTO.Options
 }

ptitjes · 2025-07-11T09:38:52Z

I understand your points. Let's put this on hold for now and take some time discussing a better solution. I can live with configuring bigger context sizes directly on Ollama for some time. I just hope we can agree on an acceptable design in the next Koog releases, because this is at the expense of memory and performance.

Ololoshechkin · 2025-07-12T23:14:44Z

I have faith that in the future small LLMs with great performances will become more and more common. That is good for people and good for the climate. IMO Koog has to support these use-cases.

Totally agree with you here, @ptitjes !

Ololoshechkin · 2025-07-12T23:16:51Z

I meant that if we add this for Ollama, it would be good to have something similar in the other clients as well. and ideally, all of it should go into the 0.3.0 release. This isn’t something you necessarily have to do right now, more like the definition of a new task

I don't think we can make it for 0.3.0 already, unfortunately :(

But in fact -- adding ollama as a first thing and as experimental in 0.3.* is fine -- we don't have to be waiting for 0.4.0.

Experimental annotation -- no, we currently don't have it. So might be a great time to add it.

ptitjes · 2025-07-13T09:11:28Z

I don't think we can make it for 0.3.0 already, unfortunately :(

But in fact -- adding ollama as a first thing and as experimental in 0.3.* is fine -- we don't have to be waiting for 0.4.0.

Yes, I prefer we get an almost right design, rather than rush it.

Experimental annotation -- no, we currently don't have it. So might be a great time to add it.

I definitely think that we should have such an annotation and start being more conservative (maybe not from 0.3 or 0.4, but soon) on the new APIs we introduce. Otherwise that might bite us later.

ptitjes · 2025-07-13T09:26:25Z

@Ololoshechkin I really like your ContextWindowStrategy interface proposal, coming with some built-in implementations and a sensible default.

I will rework this PR in this way. I need a few days to think about the appropriate built-in implementations to cover the main use-cases.

Some initial thoughts though:

I wood remove seed there, as it is completely unrelated. But at the same time, I think it's only useful for tests so I wouldn't mind having that as a simple constructor parameter of OllamaClient. If you are OK with that, I will do a separate PR with only this and the corresponding integration test.
This ContextWindowStrategy would be something specific to Ollama, as I believe none of the other clients need anything similar. As far as I can see, all the proprietary models behind API handle this by themselves. So that wouldn't make much sense for those, right?
I am convinced we should add new attributes to LLModel, namely contextLength : Long and embeddingLength : Long. That would be really helpful to inform the context summarization strategies (and also this ContextWindowStrategy as a result). WDYT?

ptitjes · 2025-08-07T14:13:57Z

So, finally coming back to this. I rewrote everything following @Ololoshechkin's advice.

I defined a ContextWindowStrategy with a single method that is responsible to compute the context length value to send to Ollama as num_ctx. This value computed value can be null in case the user wishes to let the Ollama server deal with that. (This is fully documented in the KDocs of ContextWindowStrategy.)

I implemented three basic strategies:

None, that simply returns null to let the Ollama server deal with this.
Fixed, that returns a constant value. If this value is greater than what the current model supports, then we fallback to the maximum context length supported by the model (and log a warning).
FitPrompt, that computes a context length value so that the current prompt fits in the context window. This strategy uses either the given prompt tokenizer or the last reported token usage in the prompt. The strategy also ensures that the computed context length is a multiple of the given granularity (in order to avoid too many context length changes, and thus model reloads by Ollama) and is coerced between a given minimum context length and the max supported context length of the model.

I had to move PromptTokenizer and its implementation to the prompt-tokenizer module. (I don't know why it was lost near the Tokenizer agent feature.)

I added unit tests that check that the correct num_ctx value is sent to Ollama, by using a MockHttpClient. I considered that using integration tests would be really difficult (given we would have to produce very long prompts) and so not worth the cost.

ptitjes changed the title ~~Feature/ollama request customization~~ Ollama request customization (#295) Jun 25, 2025

ptitjes changed the title ~~Ollama request customization (#295)~~ Ollama request customization Jun 25, 2025

ptitjes mentioned this pull request Jun 25, 2025

[Suggestion] Ability to send num_ctx parameter to Ollama to change Context Window size #295

Open

ptitjes force-pushed the feature/ollama-request-customization branch from 252f749 to 6334bcf Compare June 25, 2025 18:35

aozherelyeva requested review from aozherelyeva and Rizzen June 25, 2025 19:25

Ololoshechkin requested changes Jun 25, 2025

View reviewed changes

Ololoshechkin requested a review from devcrocod June 25, 2025 23:02

aozherelyeva reviewed Jun 26, 2025

View reviewed changes

integration-tests/src/jvmTest/kotlin/ai/koog/integration/tests/OllamaClientIntegrationTest.kt Outdated Show resolved Hide resolved

ptitjes force-pushed the feature/ollama-request-customization branch from 421dba4 to 028a335 Compare June 27, 2025 12:01

ptitjes requested review from Ololoshechkin and aozherelyeva June 27, 2025 12:05

aozherelyeva approved these changes Jun 30, 2025

View reviewed changes

ptitjes force-pushed the feature/ollama-request-customization branch from 028a335 to 32f2dea Compare July 2, 2025 16:58

devcrocod reviewed Jul 4, 2025

View reviewed changes

ptitjes marked this pull request as draft July 11, 2025 09:39

Ololoshechkin changed the title ~~Ollama request customization~~ Allow adjusting context window sizes for Ollama dynamically Jul 12, 2025

ptitjes force-pushed the feature/ollama-request-customization branch from 32f2dea to 3ed055c Compare July 14, 2025 09:25

ptitjes force-pushed the feature/ollama-request-customization branch 2 times, most recently from 69e2881 to f8e4fbb Compare August 7, 2025 14:01

ptitjes added 3 commits August 7, 2025 16:18

Move PromptTokenizer into the prompt-tokenizer module

905e9f1

Add ContextWindowStrategy to control context length in OllamaClient

3315ef3

Add unit tests for ContextWindowStrategy in OllamaClient

6b13765

ptitjes force-pushed the feature/ollama-request-customization branch from f8e4fbb to 6b13765 Compare August 7, 2025 14:19

ptitjes requested review from krissrex, devcrocod and aozherelyeva August 7, 2025 14:19

		private val requestBuilderAction: (OllamaRequestBuilder.(prompt: Prompt, model: LLModel) -> Unit)? = null,
		private val clock: Clock = Clock.System,

Allow adjusting context window sizes for Ollama dynamically #335

Are you sure you want to change the base?

Allow adjusting context window sizes for Ollama dynamically #335

Uh oh!

Conversation

ptitjes commented Jun 25, 2025

Type of the change

Checklist for all pull requests

Additional steps for pull requests adding a new feature

Uh oh!

ptitjes commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptitjes Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptitjes Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ptitjes commented Jun 27, 2025

Uh oh!

ptitjes commented Jul 1, 2025

Uh oh!

ptitjes commented Jul 2, 2025

Uh oh!

ptitjes commented Jul 4, 2025

Uh oh!

devcrocod left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptitjes Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Controversy about using global naming strategies

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptitjes commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devcrocod commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ptitjes commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ololoshechkin commented Jul 12, 2025

Uh oh!

Ololoshechkin commented Jul 12, 2025

Uh oh!

ptitjes commented Jul 13, 2025

Uh oh!

ptitjes commented Jul 13, 2025

Uh oh!

ptitjes commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ptitjes Jun 26, 2025 •

edited

Loading

ptitjes Jul 13, 2025 •

edited

Loading

ptitjes Jul 4, 2025 •

edited

Loading

ptitjes commented Jul 4, 2025 •

edited

Loading

devcrocod commented Jul 8, 2025 •

edited

Loading

ptitjes commented Jul 11, 2025 •

edited

Loading

ptitjes commented Aug 7, 2025 •

edited

Loading