How LLM Settings Affect Prompt Engineering

Chris Latimer•July 27, 2024

The most common way to communicate with the LLM when creating and testing prompts is through an API. A few parameters can be set to get different outcomes for your prompts. Finding the right settings for your use cases may require some trial and error, but tweaking these settings is crucial to enhancing the dependability and desirability of responses. The common settings you’ll encounter when utilizing various LLM providers are listed below:

Temperature: In other words, results are more deterministic at lower temperatures since the most likely token is always chosen next. A higher temperature may cause more unpredictability, which promotes more varied or imaginative results. In essence, you are making the other potential tokens heavier. In order to encourage more factual and succinct responses, you might want to apply a lower temperature value for tasks like fact-based quality assurance. It could be useful to raise the temperature value for writing poems or other creative tasks.

Top P: Nucleus sampling is a temperature-sensitive sampling method that allows you to adjust the model’s degree of determinism. Keep this low if you’re searching for precise and factual responses. A higher value will yield more varied responses, so consider that. A low top_p value indicates that the most confident responses are chosen when using Top P, which indicates that only the tokens making up the top_p probability mass are taken into consideration for responses. As a result, a high top_p value will allow the model to consider a wider range of words, including less common ones, producing outputs that are more varied.

Generally speaking, it is advised to change either Top P or temperature, but not both.

Max Length: By modifying the max length, you can control how many tokens the model produces. You can avoid lengthy or irrelevant responses and keep costs under control by setting a maximum length.

Say Goodbye to Stale Vector Indexes Keep your AI up-to-date in real-time with Vectorize RAG pipelines Try It Free

Stop Sequences: The model stops generating tokens when it reaches a certain string. An additional method of managing the duration and composition of the model’s response is to specify stop sequences. For example, by adding “11” as a stop sequence, you can instruct the model to produce lists with no more than ten items.

Frequency Penalty: Based on how many times a token has already appeared in the response and prompt, the frequency penalty imposes a penalty on the subsequent token. The likelihood that a word will reappear decreases with increasing frequency penalty. By assigning a higher penalty to tokens that appear more frequently, this setting helps the model respond with fewer words repeated.

Presence Penalty: Similar to the frequency penalty, the presence penalty imposes a penalty on repeated tokens; however, the penalty is constant for all repeated tokens. The penalty for a token appearing twice and a token appearing ten times is equal. This configuration stops the model from responding with too many repetitions of the same phrase. You may want to use a higher presence penalty if you want the model to produce text that is creative or diverse. Alternatively, you could use a smaller presence penalty if you need the model to maintain focus.

As with top_p and temperature, it is generally advised to change the presence penalty or frequency but not both.

Remember that the version of LLM you use will affect your results before diving into some basic examples.