client.inference.serverless.chat.completionsclient.inference.serverless.modelsclient.inference.serverless.completionsclient.inference.serverless.responsesclient.inference.serverless.tokenizeclient.inference.serverless.detokenizeclient.inference.serverless.healthclient.inference.serverless.versionclient.inference.on_demand.chat.completionsclient.inference.on_demand.modelsclient.inference.on_demand.completionsclient.inference.on_demand.responsesclient.inference.on_demand.tokenizeclient.inference.on_demand.detokenizeclient.inference.on_demand.healthclient.inference.on_demand.version
chat.completions, models, completions, responses, tokenize, detokenize, health, and version on both surfaces.
Model naming depends on the selected surface:
- serverless expects a serverless model name
- on-demand expects the served gateway model name, not the Control Plane
modelIdUUID gateway_model_idremains a local config compatibility key used by the CLI flow; its value is the served gateway model name string you send asmodel
gateway_api_key is the stored inference API key used by the SDK as inference_api_key.
Choosing A Model Name
- For serverless, choose a serverless model name that is valid for the selected host.
- If you have Control Plane access for the same Tensormesh environment, discover published serverless models with
tm billing pricing serverless list. - Use the returned
pricing[].modelvalue in your request. - If you only have inference credentials, or you are targeting a different serverless host override, ask your operator or admin for the exact serverless
modelstring for that host before sending the request. - For on-demand, use the served gateway model name.
- If you are using the local operator flow,
tm init --syncstores that served name asgateway_model_id, andtm --output json config showprints it.
tm billing pricing serverless list for the same Tensormesh environment, or ask your operator or admin for the exact serverless model string first.
Verified Serverless Endpoint Map
client.inference.serverless.chat.completions: OpenAI-compatible chat completionsclient.inference.serverless.models: list models from the verified serverless hostclient.inference.serverless.completions: text completionsclient.inference.serverless.responses: responses APIclient.inference.serverless.tokenize: tokenize textclient.inference.serverless.detokenize: convert token ids back to textclient.inference.serverless.health: health endpointclient.inference.serverless.version: version endpoint
Serverless Chat Completions
Use a serverless model name here.Serverless Model Listing
On the default public serverless host, model listing also works without an inference API key.Serverless Text Completions
Serverless Responses
Tokenize And Detokenize
Health And Version
On the default public serverless host, these routes also work without an inference API key.On-Demand Chat Completions
Use the served gateway model name here, not the Control PlanemodelId UUID.
On-Demand Models, Responses, And Utilities
Use the same routed On-Demand host andX-User-Id setup for the other endpoint namespaces.
Streaming On Serverless
Tool Calling
- tool calling is documented on the chat-completions surface only
text_deltas()is a text-oriented helper; usewith_streaming_responseif you need raw stream lines for richer event handling- if your current OpenAI or Fireworks app depends on broader tool-stream semantics, verify the exact wire behavior against your target deployment before migrating
Structured Output
The currently documented structured-output mode is JSON mode:response_format.typecurrently supports onlyjson_objectandtext- JSON Schema-style
response_format={"type": "json_schema", ...}is not supported on this surface - unsupported extra keys inside
response_formatare rejected explicitly by the SDK instead of being silently dropped - use
client.inference.serverless.responseswhen you want the verified serverless responses endpoint instead of chat completions - if an upstream runtime leaks leading
<think>...</think>blocks into assistant text, the SDK strips them frommessage.content, stores the extracted text inmessage.reasoningwhen possible, andtext_deltas()suppresses those leaked blocks in streamed text output

