-
Notifications
You must be signed in to change notification settings - Fork 102
Allow adjusting context window sizes for Ollama dynamically #335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Allow adjusting context window sizes for Ollama dynamically #335
Conversation
252f749
to
6334bcf
Compare
And I had forgotten to add Kdocs. That's now fixed. |
private val requestBuilderAction: (OllamaRequestBuilder.(prompt: Prompt, model: LLModel) -> Unit)? = null, | ||
private val clock: Clock = Clock.System, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not requestBuilderAction: OllamaRequestBuilder.(prompt: Prompt, model: LLModel) -> Unit = {}
?
also why do you need the prompt
here as a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if you make this lambda the last parameter -- you would be able to write
val client = OllamaClient(...) {
seed = 0
numCtx = 100
}
Also, could you please clarify once again -- why do you need to configure it with a lambda and not just pass a data object with this parameter, ex:
val client = OllamaClient(..., requestOptions = OllamaRequestOptions(seed = 0, numCtx = 100))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not
requestBuilderAction: OllamaRequestBuilder.(prompt: Prompt, model: LLModel) -> Unit = {}
?
You're right making it nullable doesn't bring any real optimization.
Also if you make this lambda the last parameter
I was actually trying to avoid putting it as last parameter, because those are sensible parameters and I wanted for the user to consciously decide to implement the lambda, and not implement a trailing lambda because IDEA suggests it.
also why do you need the
prompt
here as a parameter?
could you please clarify once again -- why do you need to configure it with a lambda and not just pass a data object with this parameter
So, Ollama needs to allocate some VRAM to handle your context. num_ctx
tells Ollama that you want it to be able to take num_ctx
tokens in account when generating the next token. If you do not set num_ctx
, it will use the constant value configured for the server (OLLAMA_CONTEXT_LENGTH
, which is 4096 by default). Any token that doesn't fit in this context window, will be arbitrarily truncated!
When working with local models, you want the user to be able to finely tune this because you might be in a constrained environment. Also, you cannot ask the server admin to put a big value for the OLLAMA_CONTEXT_LENGTH
, because a big context is often not needed by most of the requests.
The size of the context window needs to be decided at each request, if there is a big context in the prompt. If the user can estimate (either by looking at lastTokenUsage
or using a tokenizer) that the prompt contains than 4096 tokens, then they need to allocate more context with num_ctx
.
Proprietary LLMs behind APIs do the same. At some point, they have to decide how much VRAM they allocate for your request (or rather, on what node with preallocated models they will forward your request). You just don't see it.
Unfortunately, with Ollama, it is the responsibility of the caller to define the allocated context size (in the limits of the actual max context length of the model, obviously).
When handling long-running conversations or maybe doing RAG with lots of augmented data, a typical implementation would look something like, for instance:
requestBuilderAction = { prompt, model ->
val tokenCount = prompt.lastTokenUsage
val maxContextSize = model.contextSize // <-- custom extension property here
numCtx = minOf(maxOf(2048, ((tokenCount + 500) / 1024 + 1) * 1024), maxContextSize)
}
BTW, it seems nothing is done still on the AI Assistant side: https://youtrack.jetbrains.com/issue/LLM-13677/Lift-context-size-restrictions-for-local-Ollama-model-or-make-it-configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the numCtx
would be an LLMParam in the prompt. But there is currently no way to have provider-specific LLMParams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of the context window needs to be decided at each request, if there is a big context in the prompt.
I think estimating the size based on the prompt may be tricky. You can't take into account the future output of the llm itself. It would be sufficient for me to just set the num_ctx to the max possible. But perhaps there are scenarios where this dynamic num_ctx
is useful.
I assume this requestBuilderAction
also runs before tool call results? So the llm can upsize in case of a big tool result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be sufficient for my case of using a static context value. Anyone with specific needs could heuristically add some more overhead for tools and templates. LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ptitjes ! I actually looked through this conversation again, and it looks like adding an example to the examples
module that would showcase the Ollama with fine-tuned dynamic context lenght based on the prompt and model -- will be ideal. Plus, if you could put some textual explanations there in this example, that would benefit everyone's understanding and enrich the knowledgebase around Koog. WDYT? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also would be great to reuse the Tokenizer
here for prompt estimation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than that, might be a great idea to make the main kdoc above the OllamaClient
much longer and explain all these details, and even provide a few code samples how to use this new configuration lambda (feel free to check the kdoc of AgentMemory.Feature
or EventHandler.Feature
for instance -- we already have quite long explanations in Kdocs :)). Also it's nice for LLMs to learn on these things once they read them :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding an example to the examples module that would showcase the Ollama with fine-tuned dynamic context lenght based on the prompt and model -- will be ideal. Plus, if you could put some textual explanations there in this example, that would benefit everyone's understanding and enrich the knowledgebase around Koog. WDYT? :)
make the main kdoc above the OllamaClient much longer and explain all these details
Yeah, definitely.
integration-tests/src/jvmTest/kotlin/ai/koog/integration/tests/OllamaClientIntegrationTest.kt
Outdated
Show resolved
Hide resolved
421dba4
to
028a335
Compare
Rebased on develop and applied requested changes:
|
028a335
to
32f2dea
Compare
Rebased on develop and removed the commit which @aozherelyeva also implemented in develop. |
@Ololoshechkin @devcrocod @Rizzen I don't know if either of you had any time to take a second look at this? I was hoping we could maybe have this in the 0.3 release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m worried that Ollama’s implementation is drifting further and further from the others.
Also, I think it would be better to add additional functions or constructors to creating so that I can store the request parameters in a variable and pass them that way. It’s more convenient when writing custom wrapper functions on top of koog
integration-tests/src/jvmMain/kotlin/ai/koog/integration/tests/OllamaTestFixture.kt
Outdated
Show resolved
Hide resolved
...or-ollama-client/src/commonMain/kotlin/ai/koog/prompt/executor/ollama/client/OllamaClient.kt
Outdated
Show resolved
Hide resolved
@@ -71,6 +71,9 @@ internal data class OllamaChatRequestDTO( | |||
@Serializable | |||
internal data class Options( | |||
val temperature: Double? = null, | |||
val seed: Int? = null, | |||
@SerialName("num_ctx") val numCtx: Int? = null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn’t the Ollama json config assume the use of snake_case
flag, so that we don’t have to use @SerialName
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there are @SerialName(...)
annotations everywhere in the Ollama DTO definitions.
To be honest, I didn't even know about kotlinx-serialization namingStrategy
option. So thanks for making me discover that.
However, it seems that using global naming strategies may not be the best idea. To quote JsonNamingStrategy
's KDocs:
Controversy about using global naming strategies
Global naming strategies have one key trait that makes them a debatable and controversial topic: They are very implicit. It means that by looking only at the definition of the class, it is impossible to say which names it will have in the serialized form. As a consequence, naming strategies are not friendly to refactorings. Programmer renaming myId to userId may forget to rename my_id, and vice versa. Generally, any tools one can imagine work poorly with global naming strategies: Find Usages/Rename in IDE, full-text search by grep, etc. For them, the original name and the transformed are two different things; changing one without the other may introduce bugs in many unexpected ways. The lack of a single place of definition, the inability to use automated tools, and more error-prone code lead to greater maintenance efforts for code with global naming strategies. However, there are cases where usage of naming strategies is inevitable, such as interop with an existing API or migrating a large codebase. Therefore, one should carefully weigh the pros and cons before considering adding global naming strategies to an application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, all of this should be caught by tests, so I don’t see a big issue with it
Besides, we already use namingStrategy
in other clients
So far, the only real advantage is a slight reduction in code
I don’t think it’s critical, since we can remove it or add it to the other clients at any time if needed
I understand but isn't that the goal of having different Ollama is really different in behaviour from proprietary models behind APIs. Either we embrace it, or we will always have subpar support for Ollama. I have faith that in the future small LLMs with great performances will become more and more common, and that we won't need proprietary models from private corporations as much to run quality AI agents. That is good for people and good for the climate. IMO Koog has to support these use-cases.
I am really sorry, but I don't understand at all what you mean here. Would you mind please elaborate? |
I meant that if we add this for Ollama, it would be good to have something similar in the other clients as well. and ideally, all of it should go into the 0.3.0 release. This isn’t something you necessarily have to do right now, more like the definition of a new task
What I’m getting at is that if we write our own function that creates the client internally and we want to pass parameters to it, we’ll have to expose RequestBuilder in the function signature, like this:
It’s better when we can use a simpler approach. Store our parameters in a variable and just pass them around. public class OllamaRequestBuilder(
public var seed: Int? = null
public var numCtx: Int? = null
public var numPredict: Int? = null
) {
fun validate()
fun build(): OllamaChatRequestDTO.Options
} |
I understand your points. Let's put this on hold for now and take some time discussing a better solution. I can live with configuring bigger context sizes directly on Ollama for some time. I just hope we can agree on an acceptable design in the next Koog releases, because this is at the expense of memory and performance. |
Totally agree with you here, @ptitjes ! |
I don't think we can make it for 0.3.0 already, unfortunately :( But in fact -- adding ollama as a first thing and as experimental in 0.3.* is fine -- we don't have to be waiting for 0.4.0. Experimental annotation -- no, we currently don't have it. So might be a great time to add it. |
Yes, I prefer we get an almost right design, rather than rush it.
I definitely think that we should have such an annotation and start being more conservative (maybe not from 0.3 or 0.4, but soon) on the new APIs we introduce. Otherwise that might bite us later. |
@Ololoshechkin I really like your I will rework this PR in this way. I need a few days to think about the appropriate built-in implementations to cover the main use-cases. Some initial thoughts though:
|
32f2dea
to
3ed055c
Compare
69e2881
to
f8e4fbb
Compare
So, finally coming back to this. I rewrote everything following @Ololoshechkin's advice. I defined a I implemented three basic strategies:
I had to move I added unit tests that check that the correct |
f8e4fbb
to
6b13765
Compare
This PR add the ability to customize Ollama chat requests as discussed in #295.
@aozherelyeva I had to make two modification to the
OllamaTestFixture
and I hope it's OK:baseUrl
to allow some tests to initialize a customOllamaClient
if required.Type of the change
Checklist for all pull requests
develop
as the base branchAdditional steps for pull requests adding a new feature