Documentation Index
Fetch the complete documentation index at: https://docs.tensormesh.ai/llms.txt
Use this file to discover all available pages before exploring further.
The SDK exposes two inference surfaces with the same general shape:
client.inference.serverless.chat.completions
client.inference.serverless.models
client.inference.serverless.completions
client.inference.serverless.responses
client.inference.serverless.tokenize
client.inference.serverless.detokenize
client.inference.serverless.health
client.inference.serverless.version
client.inference.on_demand.chat.completions
client.inference.on_demand.models
client.inference.on_demand.completions
client.inference.on_demand.responses
client.inference.on_demand.tokenize
client.inference.on_demand.detokenize
client.inference.on_demand.health
client.inference.on_demand.version
The public inference surface exposes chat.completions, models, completions, responses, tokenize, detokenize, health, and version on both surfaces.
Model naming depends on the selected surface:
- serverless expects a serverless model name
- on-demand expects the served gateway model name, not the Control Plane
modelId UUID
gateway_model_id remains a local config compatibility key used by the CLI flow; its value is the served gateway model name string you send as model
If you are coming from the CLI-managed flow, gateway_api_key is the stored inference API key used by the SDK as inference_api_key.
Choosing A Model Name
- For serverless, choose a serverless model name that is valid for the selected host.
- If you have Control Plane access for the same Tensormesh environment, discover published serverless models with
tm billing pricing serverless list.
- Use the returned
pricing[].model value in your request.
- If you only have inference credentials, or you are targeting a different serverless host override, ask your operator or admin for the exact serverless
model string for that host before sending the request.
- For on-demand, use the served gateway model name.
- If you are using the local operator flow,
tm init --sync stores that served name as gateway_model_id, and tm --output json config show prints it.
If you do not already have a valid serverless model name for your target host, discover it with tm billing pricing serverless list for the same Tensormesh environment, or ask your operator or admin for the exact serverless model string first.
Verified Serverless Endpoint Map
client.inference.serverless.chat.completions: OpenAI-compatible chat completions
client.inference.serverless.models: list models from the verified serverless host
client.inference.serverless.completions: text completions
client.inference.serverless.responses: responses API
client.inference.serverless.tokenize: tokenize text
client.inference.serverless.detokenize: convert token ids back to text
client.inference.serverless.health: health endpoint
client.inference.serverless.version: version endpoint
Serverless Chat Completions
Use a serverless model name here.
from tensormesh import Tensormesh
from tensormesh.types import ChatMessage
with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
completion = client.inference.serverless.chat.completions.create(
model=serverless_model_name,
messages=[ChatMessage(role="user", content="Say hello.")],
)
print(completion.choices[0].message.content)
Serverless Model Listing
On the default public serverless host, model listing also works without an inference API key.
from tensormesh import Tensormesh
with Tensormesh() as client:
models = client.inference.serverless.models.list()
print(models.data[0].id)
Serverless Text Completions
from tensormesh import Tensormesh
with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
completion = client.inference.serverless.completions.create(
model=serverless_model_name,
prompt="Reply with ok.",
)
print(completion.choices[0].text)
Serverless Responses
from tensormesh import Tensormesh
with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
response = client.inference.serverless.responses.create(
model=serverless_model_name,
input="Say hello.",
)
print(response.output[0].content[0].text)
Tokenize And Detokenize
from tensormesh import Tensormesh
with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
tokens = client.inference.serverless.tokenize.create(
model=serverless_model_name,
prompt="Hello!",
)
prompt = client.inference.serverless.detokenize.create(
model=serverless_model_name,
tokens=tokens.tokens,
)
print(tokens.tokens)
print(prompt.prompt)
Health And Version
On the default public serverless host, these routes also work without an inference API key.
from tensormesh import Tensormesh
with Tensormesh() as client:
health = client.inference.serverless.health.get()
version = client.inference.serverless.version.get()
print(health.status)
print(version.version)
On-Demand Chat Completions
Use the served gateway model name here, not the Control Plane modelId UUID.
from tensormesh import Tensormesh
from tensormesh.types import ChatMessage
with Tensormesh(
inference_api_key="YOUR_INFERENCE_API_KEY",
on_demand_base_url="https://YOUR_ON_DEMAND_BASE_URL",
on_demand_user_id="00000000-0000-0000-0000-000000000000",
) as client:
served_gateway_model_name = "YOUR_SERVED_GATEWAY_MODEL_NAME"
completion = client.inference.on_demand.chat.completions.create(
model=served_gateway_model_name,
messages=[ChatMessage(role="user", content="Say hello.")],
)
print(completion.choices[0].message.content)
On-Demand Models, Responses, And Utilities
Use the same routed On-Demand host and X-User-Id setup for the other endpoint namespaces.
from tensormesh import Tensormesh
with Tensormesh(
inference_api_key="YOUR_INFERENCE_API_KEY",
on_demand_base_url="https://YOUR_ON_DEMAND_BASE_URL",
on_demand_user_id="00000000-0000-0000-0000-000000000000",
) as client:
models = client.inference.on_demand.models.list()
response = client.inference.on_demand.responses.create(
model="YOUR_SERVED_GATEWAY_MODEL_NAME",
input="Say hello.",
)
tokens = client.inference.on_demand.tokenize.create(
model="YOUR_SERVED_GATEWAY_MODEL_NAME",
prompt="Hello!",
)
print(models.data[0].id if models.data else "no models")
print(response.output[0].content[0].text)
print(tokens.tokens)
Streaming On Serverless
from tensormesh import Tensormesh
from tensormesh.types import ChatMessage
with Tensormesh(
inference_api_key="YOUR_INFERENCE_API_KEY",
) as client:
serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
stream = client.inference.serverless.chat.completions.create(
model=serverless_model_name,
messages=[ChatMessage(role="user", content="Stream a short reply.")],
stream=True,
)
for text in stream.text_deltas():
print(text, end="")
The text-completions and responses endpoints also support raw SSE access:
from tensormesh import Tensormesh
with Tensormesh(
inference_api_key="YOUR_INFERENCE_API_KEY",
) as client:
serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
stream = client.inference.serverless.completions.with_streaming_response.create(
model=serverless_model_name,
prompt="Stream a short reply.",
stream=True,
)
try:
for line in stream.iter_lines(decode_unicode=True):
print(line)
finally:
stream.close()
from tensormesh import Tensormesh
from tensormesh.types import ChatCompletionFunction
from tensormesh.types import ChatCompletionTool
from tensormesh.types import ChatMessage
with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
tools = [
ChatCompletionTool(
function=ChatCompletionFunction(
name="lookup_weather",
description="Look up weather for a city.",
parameters={
"type": "object",
"properties": {
"city": {
"type": "string"
}
},
"required": ["city"],
},
),
)
]
completion = client.inference.serverless.chat.completions.create(
model=serverless_model_name,
messages=[ChatMessage(role="user", content="What is the weather in Hanoi?")],
tools=tools,
tool_choice="auto",
)
choice = completion.choices[0]
print(choice.message.content)
print(choice.message.tool_calls)
Tool-calling caveats on this SDK surface:
- tool calling is documented on the chat-completions surface only
text_deltas() is a text-oriented helper; use with_streaming_response if you need raw stream lines for richer event handling
- if your current OpenAI or Fireworks app depends on broader tool-stream semantics, verify the exact wire behavior against your target deployment before migrating
Structured Output
The currently documented structured-output mode is JSON mode:
from tensormesh import Tensormesh
from tensormesh.types import ChatMessage
from tensormesh.types import ResponseFormat
with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
completion = client.inference.serverless.chat.completions.create(
model=serverless_model_name,
messages=[ChatMessage(role="user", content="Respond with valid JSON.")],
response_format=ResponseFormat(type="json_object"),
)
print(completion.choices[0].message.content)
Structured-output caveats on this SDK surface:
response_format.type currently supports only json_object and text
- JSON Schema-style
response_format={"type": "json_schema", ...} is not supported on this surface
- unsupported extra keys inside
response_format are rejected explicitly by the SDK instead of being silently dropped
- use
client.inference.serverless.responses when you want the verified serverless responses endpoint instead of chat completions
- if an upstream runtime leaks leading
<think>...</think> blocks into assistant text, the SDK strips them from message.content, stores the extracted text in message.reasoning when possible, and text_deltas() suppresses those leaked blocks in streamed text output
Raw Responses
Use raw responses when you want the unwrapped HTTP payload instead of the parsed SDK model.
from tensormesh import Tensormesh
from tensormesh.types import ChatMessage
with Tensormesh(inference_api_key="YOUR_INFERENCE_API_KEY") as client:
serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
raw_response = client.inference.serverless.chat.completions.with_raw_response.create(
model=serverless_model_name,
messages=[ChatMessage(role="user", content="Say hello.")],
)
print(raw_response.json())
Async Streaming Response Access
import asyncio
from tensormesh import AsyncTensormesh
from tensormesh.types import ChatMessage
async def main() -> None:
async with AsyncTensormesh(
inference_api_key="YOUR_INFERENCE_API_KEY",
) as client:
serverless_model_name = "YOUR_SERVERLESS_MODEL_NAME"
raw_stream = await client.inference.serverless.chat.completions.with_streaming_response.create(
model=serverless_model_name,
messages=[ChatMessage(role="user", content="Stream a short reply.")],
stream=True,
)
try:
async for line in raw_stream.iter_lines():
print(line)
finally:
await raw_stream.close()
asyncio.run(main())