Serverless Inference - Tensormesh User Documentation

Serverless inference lets you call any supported model through a simple API — no GPU provisioning, no deployments, no setup. Pay per token and start immediately. Navigate to Deploy → Serverless from the sidebar to browse models and get your API key.

How It Works

Browse Models

View all available serverless models in the model catalog. Each card shows the model name, family, context window, and per-token pricing.

Filter & Select

Use the search bar, capability filters (Coding, Reasoning, Agentic, Tool Use, Chat), or use case filters (Production APIs, Automation, Low-Latency, Research, etc.) to find the right model for your workload.

Review Details

Select a model to view its full specifications — parameter count, architecture, context window, capabilities, and detailed pricing breakdown.

Call the API

Use the provided code examples (cURL, Python, SDK, or CLI) with your API key to start sending requests immediately. No deployment step required.

Pricing

Serverless models use pay-per-token pricing, displayed as a rate per 1M tokens on each model card. You are charged for input and output tokens at per-model rates. Cached tokens are $0.00 — when Tensormesh serves a token from its KV cache, you are not charged for it. Current pricing for each model is displayed on the Deploy → Serverless page. For a full breakdown, see Pricing Overview.

Track your token usage and costs at any time under Operations → Serverless Usage. See Serverless Usage for the full per-model breakdown.

Choosing the Right Model

Reasoning & Agents

Look for large MoE models with long context windows. Higher parameter counts generally mean stronger reasoning and tool use for complex agentic workflows.

Coding Agents

Coding-specialized models with large context windows are best for multi-file tasks, code review, and agentic workflows. Smaller coding models offer a faster, cheaper alternative for simpler tasks.

Low-Latency & High-Throughput

Smaller, compact models are the fastest and most cost-effective — ideal for real-time assistants, chatbots, and high-volume API services where speed matters more than raw capability.

Long-Context & Document Processing

For workflows that process large documents, codebases, or long conversation histories, prioritize models with the largest context windows available in the catalog.

Quick Start

cURL
Python
SDK

curl --request POST \
  --url https://serverless.tensormesh.ai/v1/chat/completions \
  --header 'Authorization: Bearer YOUR_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "MODEL_NAME",
  "max_tokens": 16384,
  "temperature": 0.6,
  "messages": [
    {
      "role": "user",
      "content": "Hello, how are you?"
    }
  ],
  "top_p": 1,
  "top_k": 40,
  "presence_penalty": 0,
  "frequency_penalty": 0
}
'

import requests

url = "https://serverless.tensormesh.ai/v1/chat/completions"

payload = {
    "model": "MODEL_NAME",
    "max_tokens": 16384,
    "temperature": 0.6,
    "messages": [
        {
            "role": "user",
            "content": "Hello, how are you?"
        }
    ],
    "top_p": 1,
    "top_k": 40,
    "presence_penalty": 0,
    "frequency_penalty": 0
}
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.text)

Install the Tensormesh Python SDK (pip install tensormesh), then:

from tensormesh import Tensormesh
from tensormesh.types import ChatMessage

with Tensormesh(inference_api_key="YOUR_API_KEY") as client:
    completion = client.inference.serverless.chat.completions.create(
        model="MODEL_NAME",
        messages=[
            ChatMessage(role="user", content="Hello, how are you?"),
        ],
    )

print(completion.choices[0].message.content)

The serverless API is OpenAI-compatible. You can use any OpenAI SDK or client library by pointing the base URL to https://serverless.tensormesh.ai and using your Tensormesh API key.

​How It Works

​Pricing

​Choosing the Right Model

Reasoning & Agents

Coding Agents

Low-Latency & High-Throughput

Long-Context & Document Processing

​Quick Start

How It Works

Pricing

Choosing the Right Model

Quick Start