Skip to main content
POST
https://serverless.tensormesh.ai
/
v1
/
chat
/
completions
Create Chat Completion
curl --request POST \
  --url https://serverless.tensormesh.ai/v1/chat/completions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "MiniMaxAI/MiniMax-M2.5",
  "messages": [
    {
      "role": "user",
      "content": "Hello, how are you?"
    }
  ]
}
'
{
  "id": "<string>",
  "created": 123,
  "model": "<string>",
  "choices": [
    {
      "index": 123,
      "message": {
        "role": "<string>",
        "content": "<string>",
        "refusal": "<string>",
        "annotations": [
          {}
        ],
        "audio": {},
        "function_call": {},
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "<string>",
              "arguments": "<string>"
            },
            "id": "<string>"
          }
        ],
        "reasoning": "<string>"
      },
      "finish_reason": "<string>",
      "logprobs": {},
      "raw_output": {},
      "stop_reason": "<string>",
      "token_ids": [
        123
      ]
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 123,
    "total_tokens": 123,
    "completion_tokens": 123,
    "prompt_tokens_details": {}
  },
  "perf_metrics": {},
  "prompt_token_ids": [
    123
  ]
}
Use this page when you want the public serverless inference surface.
  • Auth: Authorization: Bearer <API_KEY>
  • Host: https://serverless.tensormesh.ai
  • Compatibility: OpenAI-compatible chat completions
  • Best for: the closest raw-HTTP match to an OpenAI-style chat request
If you have Control Plane access for the same Tensormesh environment, discover published serverless models with tm billing pricing serverless list and use the returned pricing[].model value in the request body. If you are targeting a different serverless host override, or you only have inference credentials, get the model name from your Tensormesh environment before sending the request. Start with Choose A Serverless Model Name if you do not already know the model name you need. Other verified serverless reference pages: For raw request setup, see API Quickstart. If you need a routed deployment instead of the shared serverless host, use On-Demand Chat Completions.

Authorizations

Authorization
string
header
required

Bearer authentication using your serverless API key. Format: Bearer <API_KEY>

Body

application/json
model
string
required

Serverless model name to use.

If you have Control Plane access for the same Tensormesh environment, discover published serverless models with tm billing pricing serverless list and use the returned pricing[].model value here. If you are targeting a different serverless host override, or you only have inference credentials, get the model name from your Tensormesh environment before sending requests.

Example:

"MiniMaxAI/MiniMax-M2.5"

messages
ChatMessage · object[]
required

A list of messages comprising the conversation so far.

temperature
number | null

Sampling temperature. Higher values make the output more random.

Note: temperature=0 is greedy sampling.

top_p
number | null

Nucleus sampling. We generally recommend altering this or temperature but not both.

max_tokens
integer | null

The maximum number of tokens to generate in the completion.

If set too low, the model may hit finish_reason="length" before producing useful message.content.

max_completion_tokens
integer | null

Alternative name for max_tokens (this inference surface accepts both; if both are set, behavior is runtime-dependent).

n
integer | null
default:1

How many choices to generate.

Note: when using greedy sampling (temperature=0), n must be 1 (otherwise a 400 error).

stop

Stop sequence(s) where the API will stop generating further tokens.

top_k
integer | null

Top-k sampling. Filters candidates to the K most likely tokens at each step.

min_p
number | null

Minimum probability threshold for token selection (alternative to top_p / top_k).

typical_p
number | null

Typical-p sampling parameter.

seed
integer | null

Random seed for best-effort deterministic sampling (model/runtime dependent).

repetition_penalty
number | null

Applies a penalty to repeated tokens to discourage repetition.

mirostat_target
number | null

Target perplexity for Mirostat sampling (if supported).

mirostat_lr
number | null

Learning rate for Mirostat sampling (if supported).

stream
boolean | null
default:false

Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent events (SSE) as they become available, with the stream terminated by a data: [DONE] message.

stream_options
StreamOptions · object

Streaming options. Only valid when stream=true.

response_format
ResponseFormat · object

Allows forcing the model to produce a specific output format.

Supported values:

  • { "type": "json_object" } (JSON mode)
  • { "type": "text" }

Note: extra keys inside response_format are rejected by this inference surface.

tools
ChatCompletionTool · object[] | null

A list of tools the model may call. Currently, only functions are supported as a tool.

tool_choice
default:auto

Controls which (if any) tool is called by the model.

  • none: the model will not call any tool and instead generates a message.
  • auto: the model can pick between generating a message or calling tools.
  • required: the model is instructed to call one or more tools (model/runtime dependent).
Available options:
auto,
none,
required
parallel_tool_calls
boolean | null

Enable parallel tool/function calling (if supported).

presence_penalty
number | null

Penalizes new tokens based on whether they appear in the text so far.

frequency_penalty
number | null

Penalizes new tokens based on their existing frequency in the text so far.

logprobs
boolean | null

Include per-token log probabilities in the response (when supported by the model/runtime).

top_logprobs
integer | null

Number of most likely tokens to return at each position (requires logprobs=true).

logit_bias
Logit Bias · object

Modify the likelihood of specified tokens appearing in the completion.

Maps token id (string) to bias (number).

user
string | null

A unique identifier representing your end-user.

metadata
Metadata · object

Additional metadata to store with the request for tracing.

prompt_truncate_len
integer | null

Truncate chat prompts (in tokens) to this length by evicting older messages first.

context_length_exceeded_behavior
enum<string> | null
default:truncate

What to do when prompt plus max tokens exceeds context window.

Available options:
truncate,
error
return_token_ids
boolean | null
default:false

Return token IDs alongside text (populates choices[].token_ids).

prompt_cache_isolation_key
string | null

Isolation key for prompt caching to separate cache entries (if supported).

raw_output
boolean | null
default:false

Return raw output from the model.

Note: support is model/runtime dependent. Some deployments may return an error when enabled.

perf_metrics_in_response
boolean | null
default:false

Whether to include performance metrics in the response body.

Note: this inference surface may accept the field but not include any extra metrics in the response body.

echo
boolean | null
default:false

Echo back the prompt in addition to the completion.

Note: support is model/runtime dependent. Some deployments may return an error when enabled.

echo_last
integer | null

Echo back the last N tokens of the prompt (if supported).

ignore_eos
boolean | null
default:false

Whether the model should ignore the EOS token (model/runtime dependent).

Note: support is model/runtime dependent. Some deployments may return an error when enabled.

speculation

Speculative decoding prompt or token IDs (if supported).

prediction

OpenAI-compatible predicted output for speculative decoding (if supported).

reasoning_effort

Controls reasoning behavior for supported models (model/runtime dependent).

Available options:
low,
medium,
high,
none
reasoning_history
enum<string> | null

Controls how historical assistant reasoning content is included in the prompt (if supported).

Available options:
disabled,
interleaved,
preserved
thinking
ThinkingConfigEnabled · object

Alternative Anthropic-compatible config for reasoning (if supported).

functions
ChatCompletionFunction · object[] | null

Deprecated (OpenAI). Use tools instead.

function_call

Deprecated (OpenAI). Use tool_choice instead.

Available options:
auto,
none

Response

Successful Response

id
string
required

A unique identifier of the response.

created
integer
required

The Unix time in seconds when the response was generated.

model
string
required

The model used for the chat completion.

choices
ChatCompletionResponseChoice · object[]
required

The list of chat completion choices.

object
string
default:chat.completion

The object type, which is always "chat.completion".

usage
UsageInfo · object
perf_metrics
Perf Metrics · object

Optional performance metrics (if enabled/supported).

prompt_token_ids
integer[] | null

Optional prompt token ids (when enabled/supported).