Skip to main content
POST
/
v1
/
chat
/
completions
Create Chat Completion
curl --request POST \
  --url https://external.nebius.tensormesh.ai/v1/chat/completions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'X-User-Id: <x-user-id>' \
  --data '
{
  "model": "openai-gpt-oss-120b-gpu-type-h200x1_8nic16",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Write a haiku about cloud compute."
    }
  ]
}
'
{
  "id": "<string>",
  "created": 123,
  "model": "<string>",
  "choices": [
    {
      "index": 123,
      "message": {
        "role": "<string>",
        "content": "<string>",
        "refusal": "<string>",
        "annotations": [
          {}
        ],
        "audio": {},
        "function_call": {},
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "<string>",
              "arguments": "<string>"
            },
            "id": "<string>"
          }
        ],
        "reasoning": "<string>"
      },
      "finish_reason": "<string>",
      "logprobs": {},
      "raw_output": {},
      "stop_reason": "<string>",
      "token_ids": [
        123
      ]
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 123,
    "total_tokens": 123,
    "completion_tokens": 123,
    "prompt_tokens_details": {}
  },
  "perf_metrics": {},
  "prompt_token_ids": [
    123
  ]
}
Use this page when you want to call a specific routed On-Demand deployment over HTTP.
  • Auth: Authorization: Bearer <API_KEY>
  • Routing: required X-User-Id: <uuid>
  • Compatibility: near-compatible OpenAI-style chat completions
  • Host selection: choose the external Tensormesh host for your provider, such as external.nebius.tensormesh.ai
  • Best for: requests against a served gateway model that already exists in your Tensormesh environment
Other On-Demand reference pages: For raw request setup and operator flow context, see API Quickstart. If you need the public serverless host instead, use Serverless Chat Completions.

Authorizations

Authorization
string
header
required

Bearer authentication using your On-Demand API key. Format: Bearer <API_KEY>

Headers

X-User-Id
string<uuid>
required

Tensormesh user id used for attribution and routing.

Body

application/json
model
string
required

The On-Demand served model name to use.

Tip: discover served model names from Control Plane model inventory, for example via tm models list or the published Control Plane API reference in these docs.

Example:

"openai-gpt-oss-120b-gpu-type-h200x1_8nic16"

messages
ChatMessage · object[]
required

A list of messages comprising the conversation so far.

temperature
number | null

Sampling temperature. Higher values make the output more random.

Note: temperature=0 is greedy sampling.

top_p
number | null

Nucleus sampling. We generally recommend altering this or temperature but not both.

max_tokens
integer | null

The maximum number of tokens to generate in the completion.

If set too low, the model may hit finish_reason="length" before producing useful message.content.

max_completion_tokens
integer | null

Alternative name for max_tokens (this inference surface accepts both; if both are set, behavior is runtime-dependent).

n
integer | null
default:1

How many choices to generate.

Note: when using greedy sampling (temperature=0), n must be 1 (otherwise a 400 error).

stop

Stop sequence(s) where the API will stop generating further tokens.

top_k
integer | null

Top-k sampling. Filters candidates to the K most likely tokens at each step.

min_p
number | null

Minimum probability threshold for token selection (alternative to top_p / top_k).

typical_p
number | null

Typical-p sampling parameter.

seed
integer | null

Random seed for best-effort deterministic sampling (model/runtime dependent).

repetition_penalty
number | null

Applies a penalty to repeated tokens to discourage repetition.

mirostat_target
number | null

Target perplexity for Mirostat sampling (if supported).

mirostat_lr
number | null

Learning rate for Mirostat sampling (if supported).

stream
boolean | null
default:false

Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent events (SSE) as they become available, with the stream terminated by a data: [DONE] message.

stream_options
StreamOptions · object

Streaming options. Only valid when stream=true.

response_format
ResponseFormat · object

Allows forcing the model to produce a specific output format.

Supported values:

  • { "type": "json_object" } (JSON mode)
  • { "type": "text" }

Note: extra keys inside response_format are rejected by this inference surface.

tools
ChatCompletionTool · object[] | null

A list of tools the model may call. Currently, only functions are supported as a tool.

tool_choice
default:auto

Controls which (if any) tool is called by the model.

  • none: the model will not call any tool and instead generates a message.
  • auto: the model can pick between generating a message or calling tools.
  • required: the model is instructed to call one or more tools (model/runtime dependent).
Available options:
auto,
none,
required
parallel_tool_calls
boolean | null

Enable parallel tool/function calling (if supported).

presence_penalty
number | null

Penalizes new tokens based on whether they appear in the text so far.

frequency_penalty
number | null

Penalizes new tokens based on their existing frequency in the text so far.

logprobs
boolean | null

Include per-token log probabilities in the response (when supported by the model/runtime).

top_logprobs
integer | null

Number of most likely tokens to return at each position (requires logprobs=true).

logit_bias
Logit Bias · object

Modify the likelihood of specified tokens appearing in the completion.

Maps token id (string) to bias (number).

user
string | null

A unique identifier representing your end-user.

metadata
Metadata · object

Additional metadata to store with the request for tracing.

prompt_truncate_len
integer | null

Truncate chat prompts (in tokens) to this length by evicting older messages first.

context_length_exceeded_behavior
enum<string> | null
default:truncate

What to do when prompt plus max tokens exceeds context window.

Available options:
truncate,
error
return_token_ids
boolean | null
default:false

Return token IDs alongside text (populates choices[].token_ids).

prompt_cache_isolation_key
string | null

Isolation key for prompt caching to separate cache entries (if supported).

raw_output
boolean | null
default:false

Return raw output from the model.

Note: support is model/runtime dependent. Some deployments may return an error when enabled.

perf_metrics_in_response
boolean | null
default:false

Whether to include performance metrics in the response body.

Note: this inference surface may accept the field but not include any extra metrics in the response body.

echo
boolean | null
default:false

Echo back the prompt in addition to the completion.

Note: support is model/runtime dependent. Some deployments may return an error when enabled.

echo_last
integer | null

Echo back the last N tokens of the prompt (if supported).

ignore_eos
boolean | null
default:false

Whether the model should ignore the EOS token (model/runtime dependent).

Note: support is model/runtime dependent. Some deployments may return an error when enabled.

speculation

Speculative decoding prompt or token IDs (if supported).

prediction

OpenAI-compatible predicted output for speculative decoding (if supported).

reasoning_effort

Controls reasoning behavior for supported models (model/runtime dependent).

Available options:
low,
medium,
high,
none
reasoning_history
enum<string> | null

Controls how historical assistant reasoning content is included in the prompt (if supported).

Available options:
disabled,
interleaved,
preserved
thinking
ThinkingConfigEnabled · object

Alternative Anthropic-compatible config for reasoning (if supported).

functions
ChatCompletionFunction · object[] | null

Deprecated (OpenAI). Use tools instead.

function_call

Deprecated (OpenAI). Use tool_choice instead.

Available options:
auto,
none

Response

Successful Response

id
string
required

A unique identifier of the response.

created
integer
required

The Unix time in seconds when the response was generated.

model
string
required

The model used for the chat completion.

choices
ChatCompletionResponseChoice · object[]
required

The list of chat completion choices.

object
string
default:chat.completion

The object type, which is always "chat.completion".

usage
UsageInfo · object
perf_metrics
Perf Metrics · object

Optional performance metrics (if enabled/supported).

prompt_token_ids
integer[] | null

Optional prompt token ids (when enabled/supported).