Skip to main content
The SDK exposes two inference surfaces with the same general shape:
  • client.inference.serverless.chat.completions
  • client.inference.serverless.models
  • client.inference.serverless.completions
  • client.inference.serverless.responses
  • client.inference.serverless.tokenize
  • client.inference.serverless.detokenize
  • client.inference.serverless.health
  • client.inference.serverless.version
  • client.inference.on_demand.chat.completions
  • client.inference.on_demand.models
  • client.inference.on_demand.completions
  • client.inference.on_demand.responses
  • client.inference.on_demand.tokenize
  • client.inference.on_demand.detokenize
  • client.inference.on_demand.health
  • client.inference.on_demand.version
The public inference surface exposes chat.completions, models, completions, responses, tokenize, detokenize, health, and version on both surfaces. Model naming depends on the selected surface:
  • serverless expects a serverless model name
  • on-demand expects the served gateway model name, not the Control Plane modelId UUID
  • gateway_model_id remains a local config compatibility key used by the CLI flow; its value is the served gateway model name string you send as model
If you are coming from the CLI-managed flow, gateway_api_key is the stored inference API key used by the SDK as inference_api_key.

Choosing A Model Name

  • For serverless, choose a serverless model name that is valid for the selected host.
  • If you have Control Plane access for the same Tensormesh environment, discover published serverless models with tm billing pricing serverless list.
  • Use the returned pricing[].model value in your request.
  • If you only have inference credentials, or you are targeting a different serverless host override, ask your operator or admin for the exact serverless model string for that host before sending the request.
  • For on-demand, use the served gateway model name.
  • If you are using the local operator flow, tm init --sync stores that served name as gateway_model_id, and tm --output json config show prints it.
If you do not already have a valid serverless model name for your target host, discover it with tm billing pricing serverless list for the same Tensormesh environment, or ask your operator or admin for the exact serverless model string first.

Verified Serverless Endpoint Map

  • client.inference.serverless.chat.completions: OpenAI-compatible chat completions
  • client.inference.serverless.models: list models from the verified serverless host
  • client.inference.serverless.completions: text completions
  • client.inference.serverless.responses: responses API
  • client.inference.serverless.tokenize: tokenize text
  • client.inference.serverless.detokenize: convert token ids back to text
  • client.inference.serverless.health: health endpoint
  • client.inference.serverless.version: version endpoint

Serverless Chat Completions

Use a serverless model name here.
from tensormesh import Tensormesh
from tensormesh.types import ChatMessage

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    completion = client.inference.serverless.chat.completions.create(
        model=serverless_model_name,
        messages=[ChatMessage(role="user", content="Say hello.")],
    )

print(completion.choices[0].message.content)

Serverless Model Listing

On the default public serverless host, model listing also works without an inference API key.
from tensormesh import Tensormesh

with Tensormesh() as client:
    models = client.inference.serverless.models.list()

print(models.data[0].id)

Serverless Text Completions

from tensormesh import Tensormesh

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    completion = client.inference.serverless.completions.create(
        model=serverless_model_name,
        prompt="Reply with ok.",
    )

print(completion.choices[0].text)

Serverless Responses

from tensormesh import Tensormesh

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    response = client.inference.serverless.responses.create(
        model=serverless_model_name,
        input="Say hello.",
    )

print(response.output[0].content[0].text)

Tokenize And Detokenize

from tensormesh import Tensormesh

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    tokens = client.inference.serverless.tokenize.create(
        model=serverless_model_name,
        prompt="Hello!",
    )
    prompt = client.inference.serverless.detokenize.create(
        model=serverless_model_name,
        tokens=tokens.tokens,
    )

print(tokens.tokens)
print(prompt.prompt)

Health And Version

On the default public serverless host, these routes also work without an inference API key.
from tensormesh import Tensormesh

with Tensormesh() as client:
    health = client.inference.serverless.health.get()
    version = client.inference.serverless.version.get()

print(health.status)
print(version.version)

On-Demand Chat Completions

Use the served gateway model name here, not the Control Plane modelId UUID.
from tensormesh import Tensormesh
from tensormesh.types import ChatMessage

with Tensormesh(
    inference_api_key="YOUR_INFERENCE_API_KEY",
    on_demand_base_url="https://YOUR_ON_DEMAND_BASE_URL",
    on_demand_user_id="00000000-0000-0000-0000-000000000000",
) as client:
    served_gateway_model_name = "YOUR_SERVED_GATEWAY_MODEL_NAME"
    completion = client.inference.on_demand.chat.completions.create(
        model=served_gateway_model_name,
        messages=[ChatMessage(role="user", content="Say hello.")],
    )

print(completion.choices[0].message.content)

On-Demand Models, Responses, And Utilities

Use the same routed On-Demand host and X-User-Id setup for the other endpoint namespaces.
from tensormesh import Tensormesh

with Tensormesh(
    inference_api_key="YOUR_INFERENCE_API_KEY",
    on_demand_base_url="https://YOUR_ON_DEMAND_BASE_URL",
    on_demand_user_id="00000000-0000-0000-0000-000000000000",
) as client:
    models = client.inference.on_demand.models.list()
    response = client.inference.on_demand.responses.create(
        model="YOUR_SERVED_GATEWAY_MODEL_NAME",
        input="Say hello.",
    )
    tokens = client.inference.on_demand.tokenize.create(
        model="YOUR_SERVED_GATEWAY_MODEL_NAME",
        prompt="Hello!",
    )

print(models.data[0].id if models.data else "no models")
print(response.output[0].content[0].text)
print(tokens.tokens)

Streaming On Serverless

from tensormesh import Tensormesh
from tensormesh.types import ChatMessage

with Tensormesh(
    inference_api_key="YOUR_INFERENCE_API_KEY",
) as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    stream = client.inference.serverless.chat.completions.create(
        model=serverless_model_name,
        messages=[ChatMessage(role="user", content="Stream a short reply.")],
        stream=True,
    )

    for text in stream.text_deltas():
        print(text, end="")
The text-completions and responses endpoints also support raw SSE access:
from tensormesh import Tensormesh

with Tensormesh(
    inference_api_key="YOUR_INFERENCE_API_KEY",
) as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    stream = client.inference.serverless.completions.with_streaming_response.create(
        model=serverless_model_name,
        prompt="Stream a short reply.",
        stream=True,
    )
    try:
        for line in stream.iter_lines(decode_unicode=True):
            print(line)
    finally:
        stream.close()

Tool Calling

from tensormesh import Tensormesh
from tensormesh.types import ChatCompletionFunction
from tensormesh.types import ChatCompletionTool
from tensormesh.types import ChatMessage

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    tools = [
        ChatCompletionTool(
            function=ChatCompletionFunction(
                name="lookup_weather",
                description="Look up weather for a city.",
                parameters={
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string"
                        }
                    },
                    "required": ["city"],
                },
            ),
        )
    ]

    completion = client.inference.serverless.chat.completions.create(
        model=serverless_model_name,
        messages=[ChatMessage(role="user", content="What is the weather in Hanoi?")],
        tools=tools,
        tool_choice="auto",
    )

choice = completion.choices[0]
print(choice.message.content)
print(choice.message.tool_calls)
Tool-calling caveats on this SDK surface:
  • tool calling is documented on the chat-completions surface only
  • text_deltas() is a text-oriented helper; use with_streaming_response if you need raw stream lines for richer event handling
  • if your current OpenAI or Fireworks app depends on broader tool-stream semantics, verify the exact wire behavior against your target deployment before migrating

Structured Output

The currently documented structured-output mode is JSON mode:
from tensormesh import Tensormesh
from tensormesh.types import ChatMessage
from tensormesh.types import ResponseFormat

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    completion = client.inference.serverless.chat.completions.create(
        model=serverless_model_name,
        messages=[ChatMessage(role="user", content="Respond with valid JSON.")],
        response_format=ResponseFormat(type="json_object"),
    )

print(completion.choices[0].message.content)
Structured-output caveats on this SDK surface:
  • response_format.type currently supports only json_object and text
  • JSON Schema-style response_format={"type": "json_schema", ...} is not supported on this surface
  • unsupported extra keys inside response_format are rejected explicitly by the SDK instead of being silently dropped
  • use client.inference.serverless.responses when you want the verified serverless responses endpoint instead of chat completions
  • if an upstream runtime leaks leading <think>...</think> blocks into assistant text, the SDK strips them from message.content, stores the extracted text in message.reasoning when possible, and text_deltas() suppresses those leaked blocks in streamed text output

Raw Responses

Use raw responses when you want the unwrapped HTTP payload instead of the parsed SDK model.
from tensormesh import Tensormesh
from tensormesh.types import ChatMessage

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    raw_response = client.inference.serverless.chat.completions.with_raw_response.create(
        model=serverless_model_name,
        messages=[ChatMessage(role="user", content="Say hello.")],
    )

print(raw_response.json())

Async Streaming Response Access

import asyncio

from tensormesh import AsyncTensormesh
from tensormesh.types import ChatMessage


async def main() -> None:
    async with AsyncTensormesh(
        inference_api_key="YOUR_INFERENCE_API_KEY",
    ) as client:
        serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
        raw_stream = await client.inference.serverless.chat.completions.with_streaming_response.create(
            model=serverless_model_name,
            messages=[ChatMessage(role="user", content="Stream a short reply.")],
            stream=True,
        )
        try:
            async for line in raw_stream.iter_lines():
                print(line)
        finally:
            await raw_stream.close()


asyncio.run(main())