Skip to main content
Use this section when you are calling Tensormesh inference directly over HTTP instead of through the Python SDK or the tm CLI. Direct inference callers should expect 429 rate limits on busy surfaces, honor Retry-After when present, and avoid automatic retries around non-idempotent writes unless duplicate effects are acceptable.

Choose A Surface

  • Serverless: OpenAI-compatible chat completions plus verified models, completions, responses, tokenize, detokenize, health, and version endpoints on the public serverless host.
  • On-Demand: Tensormesh-hosted routed inference endpoints for a specific deployment. This path requires a provider-specific external host, X-User-Id, and a served gateway model name.

Quick Comparison

SurfaceBest forAuthHost selectionExtra routing
ServerlessFastest OpenAI-style requestAuthorization: Bearer <API_KEY> for POST routes; GET /v1/models, /health, and /version also work on the public host without authPublic serverless hostNone
On-DemandRequests against a specific deployed modelAuthorization: Bearer <API_KEY>Provider-specific external host for the deploymentX-User-Id: <uuid>

Start Here

If you need management APIs for users, models, billing, support, logs, or metrics, use the Control Plane API tab instead of this section.