On-Demand Chat Completions

Authorizations

Authorization

string

header

required

Bearer authentication using your On-Demand API key. Format: Bearer <API_KEY>

Headers

X-User-Id

string<uuid>

required

Tensormesh user id used for attribution and routing.

Body

application/json

model

string

required

The On-Demand served model name to use.

Tip: discover served model names from Control Plane model inventory, for example via tm models list or the published Control Plane API reference in these docs.

Example:

"openai-gpt-oss-120b-gpu-type-h200x1_8nic16"

messages

ChatMessage · object[]

required

A list of messages comprising the conversation so far.

Show child attributes

temperature

number | null

Sampling temperature. Higher values make the output more random.

Note: temperature=0 is greedy sampling.

top_p

number | null

Nucleus sampling. We generally recommend altering this or temperature but not both.

max_tokens

integer | null

The maximum number of tokens to generate in the completion.

If set too low, the model may hit finish_reason="length" before producing useful message.content.

max_completion_tokens

integer | null

Alternative name for max_tokens (this inference surface accepts both; if both are set, behavior is runtime-dependent).

integer | null

default:1

How many choices to generate.

Note: when using greedy sampling (temperature=0), n must be 1 (otherwise a 400 error).

stop

Stop sequence(s) where the API will stop generating further tokens.

top_k

integer | null

Top-k sampling. Filters candidates to the K most likely tokens at each step.

min_p

number | null

Minimum probability threshold for token selection (alternative to top_p / top_k).

typical_p

number | null

Typical-p sampling parameter.

seed

integer | null

Random seed for best-effort deterministic sampling (model/runtime dependent).

repetition_penalty

number | null

Applies a penalty to repeated tokens to discourage repetition.

mirostat_target

number | null

Target perplexity for Mirostat sampling (if supported).

mirostat_lr

number | null

Learning rate for Mirostat sampling (if supported).

stream

boolean | null

default:false

Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent events (SSE) as they become available, with the stream terminated by a data: [DONE] message.

stream_options

StreamOptions · object

Streaming options. Only valid when stream=true.

Show child attributes

response_format

ResponseFormat · object

Allows forcing the model to produce a specific output format.

Supported values:

{ "type": "json_object" } (JSON mode)
{ "type": "text" }

Note: extra keys inside response_format are rejected by this inference surface.

Show child attributes

tools

ChatCompletionTool · object[] | null

A list of tools the model may call. Currently, only functions are supported as a tool.

Show child attributes

tool_choice

default:auto

Controls which (if any) tool is called by the model.

none: the model will not call any tool and instead generates a message.
auto: the model can pick between generating a message or calling tools.
required: the model is instructed to call one or more tools (model/runtime dependent).

Available options:

auto,

none,

required

parallel_tool_calls

boolean | null

Enable parallel tool/function calling (if supported).

presence_penalty

number | null

Penalizes new tokens based on whether they appear in the text so far.

frequency_penalty

number | null

Penalizes new tokens based on their existing frequency in the text so far.

logprobs

boolean | null

Include per-token log probabilities in the response (when supported by the model/runtime).

top_logprobs

integer | null

Number of most likely tokens to return at each position (requires logprobs=true).

logit_bias

Logit Bias · object

Modify the likelihood of specified tokens appearing in the completion.

Maps token id (string) to bias (number).

Show child attributes

user

string | null

A unique identifier representing your end-user.

metadata

Metadata · object

Additional metadata to store with the request for tracing.

prompt_truncate_len

integer | null

Truncate chat prompts (in tokens) to this length by evicting older messages first.

context_length_exceeded_behavior

enum<string> | null

default:truncate

What to do when prompt plus max tokens exceeds context window.

Available options:

truncate,

error

return_token_ids

boolean | null

default:false

Return token IDs alongside text (populates choices[].token_ids).

prompt_cache_isolation_key

string | null

Isolation key for prompt caching to separate cache entries (if supported).

raw_output

boolean | null

default:false

Return raw output from the model.

Note: support is model/runtime dependent. Some deployments may return an error when enabled.

perf_metrics_in_response

boolean | null

default:false

Whether to include performance metrics in the response body.

Note: this inference surface may accept the field but not include any extra metrics in the response body.

echo

boolean | null

default:false

Echo back the prompt in addition to the completion.

Note: support is model/runtime dependent. Some deployments may return an error when enabled.

echo_last

integer | null

Echo back the last N tokens of the prompt (if supported).

ignore_eos

boolean | null

default:false

Whether the model should ignore the EOS token (model/runtime dependent).

Note: support is model/runtime dependent. Some deployments may return an error when enabled.

speculation

Speculative decoding prompt or token IDs (if supported).

prediction

OpenAI-compatible predicted output for speculative decoding (if supported).

Show child attributes

reasoning_effort

Controls reasoning behavior for supported models (model/runtime dependent).

Available options:

low,

medium,

high,

none

reasoning_history

enum<string> | null

Controls how historical assistant reasoning content is included in the prompt (if supported).

Available options:

disabled,

interleaved,

preserved

thinking

ThinkingConfigEnabled · object

Alternative Anthropic-compatible config for reasoning (if supported).

ThinkingConfigEnabled
ThinkingConfigDisabled

Show child attributes

functions

ChatCompletionFunction · object[] | null

Deprecated (OpenAI). Use tools instead.

Show child attributes

function_call

Deprecated (OpenAI). Use tool_choice instead.

Available options:

auto,

none

Response

Successful Response

string

required

A unique identifier of the response.

created

integer

required

The Unix time in seconds when the response was generated.

model

string

required

The model used for the chat completion.

choices

ChatCompletionResponseChoice · object[]

required

The list of chat completion choices.

Show child attributes

object

string

default:chat.completion

The object type, which is always "chat.completion".

usage

UsageInfo · object

Show child attributes

perf_metrics

Perf Metrics · object

Optional performance metrics (if enabled/supported).

prompt_token_ids

integer[] | null

Optional prompt token ids (when enabled/supported).

Get Started

On-Demand Inference

Serverless Inference

Models

Billing - Balance

Billing - Address

Billing - Transactions

Billing - Pricing

Billing - Products

Billing - Stripe

Billing - Model Billing

Observability

Activity

Support

Support - Reserved Deployments

User

Admin - Models

Admin - Users

Admin - Billing

Admin - Products

Admin - Pricing

Authorizations

Headers

Body

Response