Documentation

Parallel Requests

Unlimited multiple API requests can be made at the same time only limited by your plan's requests per minute.

Estimated maximum tokens/second throughput using single and parallel requests are in the Models page. Speeds will vary depending on current load from users as we dynamically balance the speed allocation to users.

Example API Request Parameters (DO NOT COPY PASTE THIS)

Use the examples in the Quick-Start page for working copy-pastable examples. Copy paste parameters that you need from here.

These example API request are to show how to use the parameters, some options might conflict and the values are arbitrary.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 import requests import json url = "https://api.awanllm.com/v1/chat/completions" # Can also use /v1/completions endpoint payload = json.dumps({ "model": "{MODEL_NAME}t", # Use messages for /chat/completions "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi!, how can I help you today?"} ], # NOTE: Some models might not accept system prompts. # Use prompt for /completions "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an assistant AI.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello there!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", # NOTE: Make sure to use the suggested prompt format for each model when using completions. Example shown is Llama 3 Instruct format. # Most important parameters "repetition_penalty": 1.1, "temperature": 0.7, "top_p": 0.9, "top_k": 40, "max_tokens": 1024, "stream": True, # Extra parameters "seed": 0, "presence_penalty": 0.6, "frequency_penalty": 0.6, "dynatemp_range": 0.5, "dynatemp_exponent": 1, "smoothing_factor": 0.0, "smoothing_curve": 1.0, "top_a": 0, "min_p": 0, "tfs": 1, # Tail-Free Sampling. "eta_cutoff": 1e-4, # Eta Sampling. Adapts the cutoff threshold based on the entropy of the token probabilities "epsilon_cutoff": 1e-4, # Epsilon Sampling. Sets a simple probability threshold for token selection. "typical_p": 1, "mirostat_mode": 0, # The mirostat mode to use. Only 2 is currently supported. "mirostat_tau": 1, # The target "surprise" value that Mirostat works towards. "mirostat_eta": 1, # Learning rate for mirostat. "use_beam_search": False, # Whether to use beam search. "length_penalty": 1.0, # Penalize sequences based on their length. Used in beam search. "early_stopping": False, # Controls the stopping condition for beam search. "stop": [], "stop_token_ids": [], "include_stop_str_in_output": False, "ignore_eos": False, "logprobs": 5, "prompt_logprobs": 0, "custom_token_bans": [], "skip_special_tokens": True, "spaces_between_special_tokens": True, "logits_processors": [] }) headers = { 'Content-Type': 'application/json', 'Authorization': f"Bearer {AWANLLM_API_KEY}" } response = requests.request("POST", url, headers=headers, data=payload)

Options Explanation

seed
The random seed to use for generation.

presence_penalty
Penalize new tokens based on whether they appear in the generated text so far. Setting it to higher than 0 encourages the model to use new tokens, and lower than zero encourages the model to repeat tokens. Disabled: 0.

frequency_penalty
Penalize new tokens based on their frequency in the generated text so far. Setting it to higher than 0 encourages the model to use new tokens, and lower than zero encourages the model to repeat tokens. Disabled: 0.

repetition_penalty
Penalize new tokens based on whether they appear in the prompt and the generated text so far. Values higher than 1 encourage the model to use new tokens, while lower than 1 encourage the model to repeat tokens. Disabled: 1.

temperature
Control the randomness of the output. Lower values make the model more deterministic, while higher values make the model more random. Disabled: 1.

dynatemp_range
Allows the user to use a Dynamic Temperature that scales based on the entropy of token probabilities (normalized by the maximum possible entropy for a distribution so it scales well across different K values). Controls the variability of token probabilities. Dynamic Temperature takes a minimum and maximum temperature values; minimum temperature will be calculated as temperature - dynatemp_range, and maximum temperature as temperature + dynatemp_range. Disabled: 0.

dynatemp_exponent
The exponent value for dynamic temperature. Defaults to 1. Higher values will trend towards lower temperatures, lower values will trend toward higher temperatures.

smoothing_factor
The smoothing factor to use for Quadratic Sampling. Disabled: 0.0.

smoothing_curve
The smoothing curve to use for Cubic Sampling. Disabled: 1.0.

top_p
Control the cumulative probability of the top tokens to consider. Disabled: 1.

top_k
Control the number of top tokens to consider. Disabled: -1.

top_a
Controls the threshold probability for tokens, reducing randomness when AI certainty is high. Does not significantly affect output creativity. Disabled: 0.

min_p
Controls the minimum probability for a token to be considered, relative to the probability of the most likely token. Disable: 0.

tfs
Tail-Free Sampling. Eliminates low probability tokens after identifying a plateau in sorted token probabilities. It minimally affects the creativity of the output and is best used for longer texts. Disabled: 1.

eta_cutoff
Used in Eta sampling, it adapts the cutoff threshold based on the entropy of the token probabilities, optimizing token selection. Value is in units of 1e-4. Disabled: 0.

epsilon_cutoff
Used in Epsilon sampling, it sets a simple probability threshold for token selection. Value is in units of 1e-4. Disabled: 0.

typical_p
This method regulates the information content in the generated text by sorting tokens based on the sum of entropy and the natural logarithm of token probability. It has a strong effect on output content but still maintains creativity even at low settings. Disabled: 1.

mirostat_mode
The mirostat mode to use. Only 2 is currently supported. Mirostat is an adaptive decoding algorithm that generates text with a predetermined perplexity value, providing control over repetitions and thus ensuring high-quality, coherent, and fluent text. Disabled: 0.

mirostat_tau
The target "surprise" value that Mirostat works towards. Range is in 0 to infinity.

mirostat_eta
The learning rate at which Mirostat updates its internal suprise value. Range is from 0 to infinity.

use_beam_search
Whether to use beam search instead of normal sampling.

length_penalty
Penalize sequences based on their length. Used in beam search.

early_stopping
Controls the stopping condition for beam search. It accepts the following values: True, where the generation stops as soon as there are best_of complete candidates; False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; "never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).

stop
List of strings (words) that stop the generation when they are generated. The returned output will not contain the stop strings.

stop_token_ids
List of token IDs that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens (e.g. EOS).

include_stop_str_in_output
Whether to include the stop strings in output text. Default: False.

ignore_eos
Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.

max_tokens
The maximum number of tokens to generate per output sequence.

logprobs
Number of log probabilities to return per output token. Note that the implementation follows the OpenAI API: The return result includes the log probabilities on the logprobs most likely tokens, as well as the chosen tokens. The API will always return the log probability of the sampled token, so there may be up to logprobs+1 elements in the response.

prompt_logprobs
Number of log probabilities to return per prompt token.

custom_token_bans
List of token IDs to ban from being generated.

skip_special_tokens
Whether to skip special tokens in the output. Default: True.

spaces_between_special_tokens
Whether to add spaces between special tokens in the output. Defaults: True.

logits_processors
List of LogitsProcessors to change the probability of token prediction at runtime. Aliased to logit_bias in the API request body.