Near-compatible chat completions endpoint on the routed Tensormesh On-Demand host.
Authorization: Bearer <API_KEY>X-User-Id: <uuid>external.nebius.tensormesh.aiBearer authentication using your On-Demand API key. Format: Bearer <API_KEY>
Tensormesh user id used for attribution and routing.
The On-Demand served model name to use.
Tip: discover served model names from Control Plane model inventory,
for example via tm models list or the published Control Plane API
reference in these docs.
"openai-gpt-oss-120b-gpu-type-h200x1_8nic16"
A list of messages comprising the conversation so far.
Sampling temperature. Higher values make the output more random.
Note: temperature=0 is greedy sampling.
Nucleus sampling. We generally recommend altering this or temperature but not both.
The maximum number of tokens to generate in the completion.
If set too low, the model may hit finish_reason="length" before producing useful message.content.
Alternative name for max_tokens (this inference surface accepts both; if both are set, behavior is runtime-dependent).
How many choices to generate.
Note: when using greedy sampling (temperature=0), n must be 1 (otherwise a 400 error).
Stop sequence(s) where the API will stop generating further tokens.
Top-k sampling. Filters candidates to the K most likely tokens at each step.
Minimum probability threshold for token selection (alternative to top_p / top_k).
Typical-p sampling parameter.
Random seed for best-effort deterministic sampling (model/runtime dependent).
Applies a penalty to repeated tokens to discourage repetition.
Target perplexity for Mirostat sampling (if supported).
Learning rate for Mirostat sampling (if supported).
Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent events (SSE) as they become available, with the stream terminated by a data: [DONE] message.
Streaming options. Only valid when stream=true.
Allows forcing the model to produce a specific output format.
Supported values:
{ "type": "json_object" } (JSON mode){ "type": "text" }Note: extra keys inside response_format are rejected by this inference surface.
A list of tools the model may call. Currently, only functions are supported as a tool.
Controls which (if any) tool is called by the model.
none: the model will not call any tool and instead generates a message.auto: the model can pick between generating a message or calling tools.required: the model is instructed to call one or more tools (model/runtime dependent).auto, none, required Enable parallel tool/function calling (if supported).
Penalizes new tokens based on whether they appear in the text so far.
Penalizes new tokens based on their existing frequency in the text so far.
Include per-token log probabilities in the response (when supported by the model/runtime).
Number of most likely tokens to return at each position (requires logprobs=true).
Modify the likelihood of specified tokens appearing in the completion.
Maps token id (string) to bias (number).
A unique identifier representing your end-user.
Additional metadata to store with the request for tracing.
Truncate chat prompts (in tokens) to this length by evicting older messages first.
What to do when prompt plus max tokens exceeds context window.
truncate, error Return token IDs alongside text (populates choices[].token_ids).
Isolation key for prompt caching to separate cache entries (if supported).
Return raw output from the model.
Note: support is model/runtime dependent. Some deployments may return an error when enabled.
Whether to include performance metrics in the response body.
Note: this inference surface may accept the field but not include any extra metrics in the response body.
Echo back the prompt in addition to the completion.
Note: support is model/runtime dependent. Some deployments may return an error when enabled.
Echo back the last N tokens of the prompt (if supported).
Whether the model should ignore the EOS token (model/runtime dependent).
Note: support is model/runtime dependent. Some deployments may return an error when enabled.
Speculative decoding prompt or token IDs (if supported).
OpenAI-compatible predicted output for speculative decoding (if supported).
Controls reasoning behavior for supported models (model/runtime dependent).
low, medium, high, none Controls how historical assistant reasoning content is included in the prompt (if supported).
disabled, interleaved, preserved Alternative Anthropic-compatible config for reasoning (if supported).
Deprecated (OpenAI). Use tools instead.
Deprecated (OpenAI). Use tool_choice instead.
auto, none Successful Response
A unique identifier of the response.
The Unix time in seconds when the response was generated.
The model used for the chat completion.
The list of chat completion choices.
The object type, which is always "chat.completion".
Optional performance metrics (if enabled/supported).
Optional prompt token ids (when enabled/supported).