OpenAI-compatible chat completions endpoint on the public Tensormesh Serverless host.
Authorization: Bearer <API_KEY>https://serverless.tensormesh.aitm billing pricing serverless list and use the returned pricing[].model value in the request body. If you are targeting a different serverless host override, or you only have inference credentials, get the model name from your Tensormesh environment before sending the request. Start with Choose A Serverless Model Name if you do not already know the model name you need.
Other verified serverless reference pages:
Bearer authentication using your serverless API key. Format: Bearer <API_KEY>
Serverless model name to use.
If you have Control Plane access for the same Tensormesh
environment, discover published serverless models with tm billing pricing serverless list and use the returned pricing[].model
value here. If you are targeting a different serverless host
override, or you only have inference credentials, get the model
name from your Tensormesh environment before sending requests.
"MiniMaxAI/MiniMax-M2.5"
A list of messages comprising the conversation so far.
Sampling temperature. Higher values make the output more random.
Note: temperature=0 is greedy sampling.
Nucleus sampling. We generally recommend altering this or temperature but not both.
The maximum number of tokens to generate in the completion.
If set too low, the model may hit finish_reason="length" before producing useful message.content.
Alternative name for max_tokens (this inference surface accepts both; if both are set, behavior is runtime-dependent).
How many choices to generate.
Note: when using greedy sampling (temperature=0), n must be 1 (otherwise a 400 error).
Stop sequence(s) where the API will stop generating further tokens.
Top-k sampling. Filters candidates to the K most likely tokens at each step.
Minimum probability threshold for token selection (alternative to top_p / top_k).
Typical-p sampling parameter.
Random seed for best-effort deterministic sampling (model/runtime dependent).
Applies a penalty to repeated tokens to discourage repetition.
Target perplexity for Mirostat sampling (if supported).
Learning rate for Mirostat sampling (if supported).
Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent events (SSE) as they become available, with the stream terminated by a data: [DONE] message.
Streaming options. Only valid when stream=true.
Allows forcing the model to produce a specific output format.
Supported values:
{ "type": "json_object" } (JSON mode){ "type": "text" }Note: extra keys inside response_format are rejected by this inference surface.
A list of tools the model may call. Currently, only functions are supported as a tool.
Controls which (if any) tool is called by the model.
none: the model will not call any tool and instead generates a message.auto: the model can pick between generating a message or calling tools.required: the model is instructed to call one or more tools (model/runtime dependent).auto, none, required Enable parallel tool/function calling (if supported).
Penalizes new tokens based on whether they appear in the text so far.
Penalizes new tokens based on their existing frequency in the text so far.
Include per-token log probabilities in the response (when supported by the model/runtime).
Number of most likely tokens to return at each position (requires logprobs=true).
Modify the likelihood of specified tokens appearing in the completion.
Maps token id (string) to bias (number).
A unique identifier representing your end-user.
Additional metadata to store with the request for tracing.
Truncate chat prompts (in tokens) to this length by evicting older messages first.
What to do when prompt plus max tokens exceeds context window.
truncate, error Return token IDs alongside text (populates choices[].token_ids).
Isolation key for prompt caching to separate cache entries (if supported).
Return raw output from the model.
Note: support is model/runtime dependent. Some deployments may return an error when enabled.
Whether to include performance metrics in the response body.
Note: this inference surface may accept the field but not include any extra metrics in the response body.
Echo back the prompt in addition to the completion.
Note: support is model/runtime dependent. Some deployments may return an error when enabled.
Echo back the last N tokens of the prompt (if supported).
Whether the model should ignore the EOS token (model/runtime dependent).
Note: support is model/runtime dependent. Some deployments may return an error when enabled.
Speculative decoding prompt or token IDs (if supported).
OpenAI-compatible predicted output for speculative decoding (if supported).
Controls reasoning behavior for supported models (model/runtime dependent).
low, medium, high, none Controls how historical assistant reasoning content is included in the prompt (if supported).
disabled, interleaved, preserved Alternative Anthropic-compatible config for reasoning (if supported).
Deprecated (OpenAI). Use tools instead.
Deprecated (OpenAI). Use tool_choice instead.
auto, none Successful Response
A unique identifier of the response.
The Unix time in seconds when the response was generated.
The model used for the chat completion.
The list of chat completion choices.
The object type, which is always "chat.completion".
Optional performance metrics (if enabled/supported).
Optional prompt token ids (when enabled/supported).