Inference - Tensormesh User Documentation

The SDK exposes the serverless inference surface:

client.inference.serverless.chat.completions
client.inference.serverless.models
client.inference.serverless.completions
client.inference.serverless.responses
client.inference.serverless.tokenize
client.inference.serverless.detokenize
client.inference.serverless.health
client.inference.serverless.version

The public inference surface exposes chat.completions, models, completions, responses, tokenize, detokenize, health, and version. Model naming: pass a serverless model name that is valid for the selected host. If you are coming from the CLI-managed flow, gateway_api_key is the stored inference API key used by the SDK as inference_api_key.

Choosing A Model Name

Choose a serverless model name that is valid for the selected host.
If you have Control Plane access for the same Tensormesh environment, discover published serverless models with tm billing pricing serverless list.
Use the returned pricing[].model value in your request.
If you only have inference credentials, or you are targeting a different serverless host override, ask your operator or admin for the exact serverless model string for that host before sending the request.

If you do not already have a valid serverless model name for your target host, discover it with tm billing pricing serverless list for the same Tensormesh environment, or ask your operator or admin for the exact serverless model string first.

Verified Serverless Endpoint Map

client.inference.serverless.chat.completions: OpenAI-compatible chat completions
client.inference.serverless.models: list models from the verified serverless host
client.inference.serverless.completions: text completions
client.inference.serverless.responses: responses API
client.inference.serverless.tokenize: tokenize text
client.inference.serverless.detokenize: convert token ids back to text
client.inference.serverless.health: health endpoint
client.inference.serverless.version: version endpoint

Serverless Chat Completions

Use a serverless model name here.

from tensormesh import Tensormesh
from tensormesh.types import ChatMessage

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    completion = client.inference.serverless.chat.completions.create(
        model=serverless_model_name,
        messages=[ChatMessage(role="user", content="Say hello.")],
    )

print(completion.choices[0].message.content)

Serverless Model Listing

On the default public serverless host, model listing also works without an inference API key.

from tensormesh import Tensormesh

with Tensormesh() as client:
    models = client.inference.serverless.models.list()

print(models.data[0].id)

Serverless Text Completions

from tensormesh import Tensormesh

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    completion = client.inference.serverless.completions.create(
        model=serverless_model_name,
        prompt="Reply with ok.",
    )

print(completion.choices[0].text)

Serverless Responses

from tensormesh import Tensormesh

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    response = client.inference.serverless.responses.create(
        model=serverless_model_name,
        input="Say hello.",
    )

print(response.output[0].content[0].text)

Tokenize And Detokenize

from tensormesh import Tensormesh

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    tokens = client.inference.serverless.tokenize.create(
        model=serverless_model_name,
        prompt="Hello!",
    )
    prompt = client.inference.serverless.detokenize.create(
        model=serverless_model_name,
        tokens=tokens.tokens,
    )

print(tokens.tokens)
print(prompt.prompt)

Health And Version

On the default public serverless host, these routes also work without an inference API key.

from tensormesh import Tensormesh

with Tensormesh() as client:
    health = client.inference.serverless.health.get()
    version = client.inference.serverless.version.get()

print(health.status)
print(version.version)

Streaming On Serverless

from tensormesh import Tensormesh
from tensormesh.types import ChatMessage

with Tensormesh(
    inference_api_key="YOUR_INFERENCE_API_KEY",
) as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    stream = client.inference.serverless.chat.completions.create(
        model=serverless_model_name,
        messages=[ChatMessage(role="user", content="Stream a short reply.")],
        stream=True,
    )

    for text in stream.text_deltas():
        print(text, end="")

The text-completions and responses endpoints also support raw SSE access:

from tensormesh import Tensormesh

with Tensormesh(
    inference_api_key="YOUR_INFERENCE_API_KEY",
) as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    stream = client.inference.serverless.completions.with_streaming_response.create(
        model=serverless_model_name,
        prompt="Stream a short reply.",
        stream=True,
    )
    try:
        for line in stream.iter_lines(decode_unicode=True):
            print(line)
    finally:
        stream.close()

Tool Calling

from tensormesh import Tensormesh
from tensormesh.types import ChatCompletionFunction
from tensormesh.types import ChatCompletionTool
from tensormesh.types import ChatMessage

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    tools = [
        ChatCompletionTool(
            function=ChatCompletionFunction(
                name="lookup_weather",
                description="Look up weather for a city.",
                parameters={
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string"
                        }
                    },
                    "required": ["city"],
                },
            ),
        )
    ]

    completion = client.inference.serverless.chat.completions.create(
        model=serverless_model_name,
        messages=[ChatMessage(role="user", content="What is the weather in Hanoi?")],
        tools=tools,
        tool_choice="auto",
    )

choice = completion.choices[0]
print(choice.message.content)
print(choice.message.tool_calls)

Tool-calling caveats on this SDK surface:

tool calling is documented on the chat-completions surface only
text_deltas() is a text-oriented helper; use with_streaming_response if you need raw stream lines for richer event handling
if your current OpenAI or Fireworks app depends on broader tool-stream semantics, verify the exact wire behavior against your target deployment before migrating

Structured Output

The currently documented structured-output mode is JSON mode:

from tensormesh import Tensormesh
from tensormesh.types import ChatMessage
from tensormesh.types import ResponseFormat

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    completion = client.inference.serverless.chat.completions.create(
        model=serverless_model_name,
        messages=[ChatMessage(role="user", content="Respond with valid JSON.")],
        response_format=ResponseFormat(type="json_object"),
    )

print(completion.choices[0].message.content)

Structured-output caveats on this SDK surface:

response_format.type currently supports only json_object and text
JSON Schema-style response_format={"type": "json_schema", ...} is not supported on this surface
unsupported extra keys inside response_format are rejected explicitly by the SDK instead of being silently dropped
use client.inference.serverless.responses when you want the verified serverless responses endpoint instead of chat completions
if an upstream runtime leaks leading <think>...</think> blocks into assistant text, the SDK strips them from message.content, stores the extracted text in message.reasoning when possible, and text_deltas() suppresses those leaked blocks in streamed text output

Raw Responses

Use raw responses when you want the unwrapped HTTP payload instead of the parsed SDK model.

from tensormesh import Tensormesh
from tensormesh.types import ChatMessage

with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
    serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
    raw_response = client.inference.serverless.chat.completions.with_raw_response.create(
        model=serverless_model_name,
        messages=[ChatMessage(role="user", content="Say hello.")],
    )

print(raw_response.json())

Async Streaming Response Access

import asyncio

from tensormesh import AsyncTensormesh
from tensormesh.types import ChatMessage


async def main() -> None:
    async with AsyncTensormesh(
        inference_api_key="YOUR_INFERENCE_API_KEY",
    ) as client:
        serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
        raw_stream = await client.inference.serverless.chat.completions.with_streaming_response.create(
            model=serverless_model_name,
            messages=[ChatMessage(role="user", content="Stream a short reply.")],
            stream=True,
        )
        try:
            async for line in raw_stream.iter_lines():
                print(line)
        finally:
            await raw_stream.close()


asyncio.run(main())

​Choosing A Model Name

​Verified Serverless Endpoint Map

​Serverless Chat Completions

​Serverless Model Listing

​Serverless Text Completions

​Serverless Responses

​Tokenize And Detokenize

​Health And Version

​Streaming On Serverless

​Tool Calling

​Structured Output

​Raw Responses

​Async Streaming Response Access

​Related Guides

​Related Reference