Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tensormesh.ai/llms.txt

Use this file to discover all available pages before exploring further.

New to AI inference or just new to Tensormesh? This glossary explains the key concepts in plain language, no background required.

The Basics

TermMeaning
LLM (Large Language Model)The type of AI model Tensormesh runs. LLMs are trained on massive amounts of text and can generate, summarize, translate, explain, and reason about language. Examples include GPT, Qwen, and MiniMax.
PromptEverything you send to the model before it starts responding. Usually a system message that sets the context, followed by one or more user messages.
System PromptAn instruction at the start of your request that tells the model how to behave — its persona, task, tone, or constraints. Keeping it identical across requests is the most effective way to improve your cache hit rate.
CompletionThe model’s response to your prompt. Also called the output or generation. You pay output token rates for completions.
TokenThe unit LLMs use to read and write text. Your bill is calculated in tokens: you pay for tokens in (your prompt) and tokens out (the model’s reply).
Input TokenAny token in the message you send to the model — system prompt, conversation history, user message, or pasted document. Billed at the per-model input rate, unless served from cache (free).
Output TokenAny token the model generates in its reply. Always freshly computed and billed at the per-model output rate.
Context WindowThe maximum number of tokens a model can process in a single request, including both what you send and what it generates. A 131K context window fits roughly 100,000 words.

Caching

TermMeaning
KV Cache (Key-Value Cache)When a model processes tokens, it computes internal data structures called key-value pairs. Tensormesh saves these so that if the same tokens appear at the start of a later request, the model can skip that work entirely, answer faster, and you don’t pay for those tokens.
Cached TokenAn input token served from the KV cache instead of being recomputed. Cached tokens are always $0. The more your requests share the same opening text, the more tokens hit the cache.
Prompt PrefixThe shared opening portion of your prompt — typically the system message and any static context. This is what the KV cache matches on. Even a single character difference at the start creates a cache miss.
Cache Hit RateThe percentage of your input tokens served from cache rather than computed fresh. A 60% cache hit rate means 60% of your input tokens cost nothing.
LMCacheThe open-source KV cache engine powering Tensormesh’s persistent caching layer. When requests share a prefix, LMCache lets later requests reuse the earlier request’s KV tensors instead of recomputing them.
External StorageA persistent cache bucket that keeps your context warm across sessions. Unlike the default in-memory cache (which resets each session), External Storage persists KV cache entries so returning workloads hit the cache more often. Plans are Bronze, Silver, and Gold.

Inference

TermMeaning
Serverless InferenceCalling a model via API with no infrastructure to manage. You send a request, Tensormesh routes it to available GPUs, and you get a response. Pay per token, no provisioning or idle costs.
Reserved DeploymentA dedicated GPU cluster that’s exclusively yours. Useful when you need guaranteed capacity, consistent low latency, or strict data isolation.
TTFT (Time to First Token)How long it takes from when you send a request to when you receive the first token of the model’s reply. Cached prefixes reduce TTFT because the model skips recomputing tokens it’s already seen.
StreamingReceiving the model’s response token by token as it’s generated, rather than waiting for the full reply. Enable it by adding "stream": true to your request.
MoE (Mixture of Experts)A model architecture where only a fraction of the model’s parameters are activated per token. For example, a 397B MoE model may only activate ~17B parameters per token — making it fast and cost-effective despite its large total size.

API & Platform

TermMeaning
OpenAI-Compatible APITensormesh’s API uses the same format as OpenAI’s. Set base_url to https://serverless.tensormesh.ai, swap in your Tensormesh API key, and nothing else changes.
API KeyA secret token included in every request so Tensormesh knows who you are. Generate one under Management → Account → API Keys. Treat it like a password.