Skip to content

Allow adjusting context window sizes for Ollama dynamically #335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

ptitjes
Copy link
Contributor

@ptitjes ptitjes commented Jun 25, 2025

This PR add the ability to customize Ollama chat requests as discussed in #295.

@aozherelyeva I had to make two modification to the OllamaTestFixture and I hope it's OK:

  • Added a check to pull the llama3.2 model in the container, because by default the image that you suggested in TESTING.md doesn't come with the llama3.2 model. Now it works without touching anything (provided you have docker obviously).
  • Exposed the baseUrl to allow some tests to initialize a custom OllamaClient if required.

Type of the change

  • New feature
  • Bug fix
  • Documentation fix

Checklist for all pull requests

  • The pull request has a description of the proposed change
  • I read the Contributing Guidelines before opening the pull request
  • The pull request uses develop as the base branch
  • Tests for the changes have been added
  • All new and existing tests passed
Additional steps for pull requests adding a new feature
  • An issue describing the proposed change exists
  • The pull request includes a link to the issue
  • The change was discussed and approved in the issue
  • Docs have been added / updated

@ptitjes ptitjes changed the title Feature/ollama request customization Ollama request customization (#295) Jun 25, 2025
@ptitjes ptitjes changed the title Ollama request customization (#295) Ollama request customization Jun 25, 2025
@ptitjes ptitjes force-pushed the feature/ollama-request-customization branch from 252f749 to 6334bcf Compare June 25, 2025 18:35
@ptitjes
Copy link
Contributor Author

ptitjes commented Jun 25, 2025

And I had forgotten to add Kdocs. That's now fixed.

Comment on lines 79 to 70
private val requestBuilderAction: (OllamaRequestBuilder.(prompt: Prompt, model: LLModel) -> Unit)? = null,
private val clock: Clock = Clock.System,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not requestBuilderAction: OllamaRequestBuilder.(prompt: Prompt, model: LLModel) -> Unit = {} ?

also why do you need the prompt here as a parameter?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you make this lambda the last parameter -- you would be able to write

val client = OllamaClient(...) {
   seed = 0
   numCtx = 100
}

Also, could you please clarify once again -- why do you need to configure it with a lambda and not just pass a data object with this parameter, ex:

val client = OllamaClient(..., requestOptions = OllamaRequestOptions(seed = 0, numCtx = 100))

Copy link
Contributor Author

@ptitjes ptitjes Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not requestBuilderAction: OllamaRequestBuilder.(prompt: Prompt, model: LLModel) -> Unit = {} ?

You're right making it nullable doesn't bring any real optimization.

Also if you make this lambda the last parameter

I was actually trying to avoid putting it as last parameter, because those are sensible parameters and I wanted for the user to consciously decide to implement the lambda, and not implement a trailing lambda because IDEA suggests it.

also why do you need the prompt here as a parameter?
could you please clarify once again -- why do you need to configure it with a lambda and not just pass a data object with this parameter

So, Ollama needs to allocate some VRAM to handle your context. num_ctx tells Ollama that you want it to be able to take num_ctx tokens in account when generating the next token. If you do not set num_ctx, it will use the constant value configured for the server (OLLAMA_CONTEXT_LENGTH, which is 4096 by default). Any token that doesn't fit in this context window, will be arbitrarily truncated!

When working with local models, you want the user to be able to finely tune this because you might be in a constrained environment. Also, you cannot ask the server admin to put a big value for the OLLAMA_CONTEXT_LENGTH, because a big context is often not needed by most of the requests.

The size of the context window needs to be decided at each request, if there is a big context in the prompt. If the user can estimate (either by looking at lastTokenUsage or using a tokenizer) that the prompt contains than 4096 tokens, then they need to allocate more context with num_ctx.

Proprietary LLMs behind APIs do the same. At some point, they have to decide how much VRAM they allocate for your request (or rather, on what node with preallocated models they will forward your request). You just don't see it.

Unfortunately, with Ollama, it is the responsibility of the caller to define the allocated context size (in the limits of the actual max context length of the model, obviously).

When handling long-running conversations or maybe doing RAG with lots of augmented data, a typical implementation would look something like, for instance:

  requestBuilderAction = { prompt, model ->
    val tokenCount = prompt.lastTokenUsage
    val maxContextSize = model.contextSize // <-- custom extension property here
    numCtx = minOf(maxOf(2048, ((tokenCount + 500) / 1024 + 1) * 1024), maxContextSize)
  }

BTW, it seems nothing is done still on the AI Assistant side: https://youtrack.jetbrains.com/issue/LLM-13677/Lift-context-size-restrictions-for-local-Ollama-model-or-make-it-configurable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the numCtx would be an LLMParam in the prompt. But there is currently no way to have provider-specific LLMParams.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the context window needs to be decided at each request, if there is a big context in the prompt.

I think estimating the size based on the prompt may be tricky. You can't take into account the future output of the llm itself. It would be sufficient for me to just set the num_ctx to the max possible. But perhaps there are scenarios where this dynamic num_ctx is useful.

I assume this requestBuilderAction also runs before tool call results? So the llm can upsize in case of a big tool result.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be sufficient for my case of using a static context value. Anyone with specific needs could heuristically add some more overhead for tools and templates. LGTM 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ptitjes ! I actually looked through this conversation again, and it looks like adding an example to the examples module that would showcase the Ollama with fine-tuned dynamic context lenght based on the prompt and model -- will be ideal. Plus, if you could put some textual explanations there in this example, that would benefit everyone's understanding and enrich the knowledgebase around Koog. WDYT? :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would be great to reuse the Tokenizer here for prompt estimation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than that, might be a great idea to make the main kdoc above the OllamaClient much longer and explain all these details, and even provide a few code samples how to use this new configuration lambda (feel free to check the kdoc of AgentMemory.Feature or EventHandler.Feature for instance -- we already have quite long explanations in Kdocs :)). Also it's nice for LLMs to learn on these things once they read them :)

Copy link
Contributor Author

@ptitjes ptitjes Jul 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding an example to the examples module that would showcase the Ollama with fine-tuned dynamic context lenght based on the prompt and model -- will be ideal. Plus, if you could put some textual explanations there in this example, that would benefit everyone's understanding and enrich the knowledgebase around Koog. WDYT? :)

make the main kdoc above the OllamaClient much longer and explain all these details

Yeah, definitely.

@Ololoshechkin Ololoshechkin requested a review from devcrocod June 25, 2025 23:02
@ptitjes ptitjes force-pushed the feature/ollama-request-customization branch from 421dba4 to 028a335 Compare June 27, 2025 12:01
@ptitjes
Copy link
Contributor Author

ptitjes commented Jun 27, 2025

Rebased on develop and applied requested changes:

  • made requestBuilderAction the last (non-nullable) constructor parameter
  • checked that the response is the same when seed=0

@ptitjes
Copy link
Contributor Author

ptitjes commented Jul 1, 2025

I'll have to rebase because the llama3.2 pull of 1012015 was merged in #371.

@ptitjes ptitjes force-pushed the feature/ollama-request-customization branch from 028a335 to 32f2dea Compare July 2, 2025 16:58
@ptitjes
Copy link
Contributor Author

ptitjes commented Jul 2, 2025

Rebased on develop and removed the commit which @aozherelyeva also implemented in develop.

@ptitjes
Copy link
Contributor Author

ptitjes commented Jul 4, 2025

@Ololoshechkin @devcrocod @Rizzen I don't know if either of you had any time to take a second look at this? I was hoping we could maybe have this in the 0.3 release.

Copy link
Contributor

@devcrocod devcrocod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m worried that Ollama’s implementation is drifting further and further from the others.

Also, I think it would be better to add additional functions or constructors to creating so that I can store the request parameters in a variable and pass them that way. It’s more convenient when writing custom wrapper functions on top of koog

@@ -71,6 +71,9 @@ internal data class OllamaChatRequestDTO(
@Serializable
internal data class Options(
val temperature: Double? = null,
val seed: Int? = null,
@SerialName("num_ctx") val numCtx: Int? = null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn’t the Ollama json config assume the use of snake_case flag, so that we don’t have to use @SerialName?

Copy link
Contributor Author

@ptitjes ptitjes Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there are @SerialName(...) annotations everywhere in the Ollama DTO definitions.

To be honest, I didn't even know about kotlinx-serialization namingStrategy option. So thanks for making me discover that.

However, it seems that using global naming strategies may not be the best idea. To quote JsonNamingStrategy's KDocs:

Controversy about using global naming strategies

Global naming strategies have one key trait that makes them a debatable and controversial topic: They are very implicit. It means that by looking only at the definition of the class, it is impossible to say which names it will have in the serialized form. As a consequence, naming strategies are not friendly to refactorings. Programmer renaming myId to userId may forget to rename my_id, and vice versa. Generally, any tools one can imagine work poorly with global naming strategies: Find Usages/Rename in IDE, full-text search by grep, etc. For them, the original name and the transformed are two different things; changing one without the other may introduce bugs in many unexpected ways. The lack of a single place of definition, the inability to use automated tools, and more error-prone code lead to greater maintenance efforts for code with global naming strategies. However, there are cases where usage of naming strategies is inevitable, such as interop with an existing API or migrating a large codebase. Therefore, one should carefully weigh the pros and cons before considering adding global naming strategies to an application.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, all of this should be caught by tests, so I don’t see a big issue with it
Besides, we already use namingStrategy in other clients
So far, the only real advantage is a slight reduction in code

I don’t think it’s critical, since we can remove it or add it to the other clients at any time if needed

@ptitjes
Copy link
Contributor Author

ptitjes commented Jul 4, 2025

I’m worried that Ollama’s implementation is drifting further and further from the others.

I understand but isn't that the goal of having different LLMClient? I.e. to alleviate the differences and take benefit of the specificities of each LLM server? Indeed in an ideal world, we would have a single LLMClient implementation.

Ollama is really different in behaviour from proprietary models behind APIs. Either we embrace it, or we will always have subpar support for Ollama.

I have faith that in the future small LLMs with great performances will become more and more common, and that we won't need proprietary models from private corporations as much to run quality AI agents. That is good for people and good for the climate. IMO Koog has to support these use-cases.

Also, I think it would be better to add additional functions or constructors to creating so that I can store the request parameters in a variable and pass them that way. It’s more convenient when writing custom wrapper functions on top of koog

I am really sorry, but I don't understand at all what you mean here. Would you mind please elaborate?

@devcrocod
Copy link
Contributor

devcrocod commented Jul 8, 2025

I understand but isn't that the goal of having different LLMClient? I.e. to alleviate the differences and take benefit of the specificities of each LLM server? Indeed in an ideal world, we would have a single LLMClient implementation.

I meant that if we add this for Ollama, it would be good to have something similar in the other clients as well. and ideally, all of it should go into the 0.3.0 release. This isn’t something you necessarily have to do right now, more like the definition of a new task
Alternatively, we could mark it as Experimental.
@Ololoshechkin do we have an Experimental annotation for Koog?

I am really sorry, but I don't understand at all what you mean here. Would you mind please elaborate?

What I’m getting at is that if we write our own function that creates the client internally and we want to pass parameters to it, we’ll have to expose RequestBuilder in the function signature, like this:

fun createOllamaClient(
    ...,
    requestBuilder: OllamaRequestBuilder.(Prompt, LLModel) -> Unit = { _, _ -> }
)

It’s better when we can use a simpler approach. Store our parameters in a variable and just pass them around.
But for that, need to have alternative constructors or factory functions. And rewrite OllamaRequestBuilder like that:

public class OllamaRequestBuilder(
    public var seed: Int? = null
    public var numCtx: Int? = null
    public var numPredict: Int? = null
) {
    fun validate()
    fun build(): OllamaChatRequestDTO.Options
 }

@ptitjes
Copy link
Contributor Author

ptitjes commented Jul 11, 2025

I understand your points. Let's put this on hold for now and take some time discussing a better solution. I can live with configuring bigger context sizes directly on Ollama for some time. I just hope we can agree on an acceptable design in the next Koog releases, because this is at the expense of memory and performance.

@ptitjes ptitjes marked this pull request as draft July 11, 2025 09:39
@Ololoshechkin
Copy link
Collaborator

I have faith that in the future small LLMs with great performances will become more and more common. That is good for people and good for the climate. IMO Koog has to support these use-cases.

Totally agree with you here, @ptitjes !

@Ololoshechkin
Copy link
Collaborator

I meant that if we add this for Ollama, it would be good to have something similar in the other clients as well. and ideally, all of it should go into the 0.3.0 release. This isn’t something you necessarily have to do right now, more like the definition of a new task

I don't think we can make it for 0.3.0 already, unfortunately :(

But in fact -- adding ollama as a first thing and as experimental in 0.3.* is fine -- we don't have to be waiting for 0.4.0.

Experimental annotation -- no, we currently don't have it. So might be a great time to add it.

@Ololoshechkin Ololoshechkin changed the title Ollama request customization Allow adjusting context window sizes for Ollama dynamically Jul 12, 2025
@ptitjes
Copy link
Contributor Author

ptitjes commented Jul 13, 2025

I don't think we can make it for 0.3.0 already, unfortunately :(

But in fact -- adding ollama as a first thing and as experimental in 0.3.* is fine -- we don't have to be waiting for 0.4.0.

Yes, I prefer we get an almost right design, rather than rush it.

Experimental annotation -- no, we currently don't have it. So might be a great time to add it.

I definitely think that we should have such an annotation and start being more conservative (maybe not from 0.3 or 0.4, but soon) on the new APIs we introduce. Otherwise that might bite us later.

@ptitjes
Copy link
Contributor Author

ptitjes commented Jul 13, 2025

@Ololoshechkin I really like your ContextWindowStrategy interface proposal, coming with some built-in implementations and a sensible default.

I will rework this PR in this way. I need a few days to think about the appropriate built-in implementations to cover the main use-cases.

Some initial thoughts though:

  1. I wood remove seed there, as it is completely unrelated. But at the same time, I think it's only useful for tests so I wouldn't mind having that as a simple constructor parameter of OllamaClient. If you are OK with that, I will do a separate PR with only this and the corresponding integration test.

  2. This ContextWindowStrategy would be something specific to Ollama, as I believe none of the other clients need anything similar. As far as I can see, all the proprietary models behind API handle this by themselves. So that wouldn't make much sense for those, right?

  3. I am convinced we should add new attributes to LLModel, namely contextLength : Long and embeddingLength : Long. That would be really helpful to inform the context summarization strategies (and also this ContextWindowStrategy as a result). WDYT?

@ptitjes ptitjes force-pushed the feature/ollama-request-customization branch from 32f2dea to 3ed055c Compare July 14, 2025 09:25
@ptitjes ptitjes force-pushed the feature/ollama-request-customization branch 2 times, most recently from 69e2881 to f8e4fbb Compare August 7, 2025 14:01
@ptitjes
Copy link
Contributor Author

ptitjes commented Aug 7, 2025

So, finally coming back to this. I rewrote everything following @Ololoshechkin's advice.

I defined a ContextWindowStrategy with a single method that is responsible to compute the context length value to send to Ollama as num_ctx. This value computed value can be null in case the user wishes to let the Ollama server deal with that. (This is fully documented in the KDocs of ContextWindowStrategy.)

I implemented three basic strategies:

  • None, that simply returns null to let the Ollama server deal with this.
  • Fixed, that returns a constant value. If this value is greater than what the current model supports, then we fallback to the maximum context length supported by the model (and log a warning).
  • FitPrompt, that computes a context length value so that the current prompt fits in the context window. This strategy uses either the given prompt tokenizer or the last reported token usage in the prompt. The strategy also ensures that the computed context length is a multiple of the given granularity (in order to avoid too many context length changes, and thus model reloads by Ollama) and is coerced between a given minimum context length and the max supported context length of the model.

I had to move PromptTokenizer and its implementation to the prompt-tokenizer module. (I don't know why it was lost near the Tokenizer agent feature.)

I added unit tests that check that the correct num_ctx value is sent to Ollama, by using a MockHttpClient. I considered that using integration tests would be really difficult (given we would have to produce very long prompts) and so not worth the cost.

@ptitjes ptitjes force-pushed the feature/ollama-request-customization branch from f8e4fbb to 6b13765 Compare August 7, 2025 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants