Glossary - Tensormesh User Documentation

New to AI inference or just new to Tensormesh? This glossary explains the key concepts in plain language, no background required.

The Basics

Term	Meaning
LLM (Large Language Model)	The type of AI model Tensormesh runs. LLMs are trained on massive amounts of text and can generate, summarize, translate, explain, and reason about language. Examples include GPT, Qwen, and MiniMax.
Prompt	Everything you send to the model before it starts responding. Usually a system message that sets the context, followed by one or more user messages.
System Prompt	An instruction at the start of your request that tells the model how to behave — its persona, task, tone, or constraints. Keeping it identical across requests is the most effective way to improve your cache hit rate.
Completion	The model’s response to your prompt. Also called the output or generation. You pay output token rates for completions.
Token	The unit LLMs use to read and write text. Your bill is calculated in tokens: you pay for tokens in (your prompt) and tokens out (the model’s reply).
Input Token	Any token in the message you send to the model — system prompt, conversation history, user message, or pasted document. Billed at the per-model input rate, unless served from cache (free).
Output Token	Any token the model generates in its reply. Always freshly computed and billed at the per-model output rate.
Context Window	The maximum number of tokens a model can process in a single request, including both what you send and what it generates. A 131K context window fits roughly 100,000 words.

Caching

Term	Meaning
KV Cache (Key-Value Cache)	When a model processes tokens, it computes internal data structures called key-value pairs. Tensormesh saves these so that if the same tokens appear at the start of a later request, the model can skip that work entirely, answer faster, and you don’t pay for those tokens.
Cached Token	An input token served from the KV cache instead of being recomputed. Cached tokens are always $0. The more your requests share the same opening text, the more tokens hit the cache.
Prompt Prefix	The shared opening portion of your prompt — typically the system message and any static context. This is what the KV cache matches on. Even a single character difference at the start creates a cache miss.
Cache Hit Rate	The percentage of your input tokens served from cache rather than computed fresh. A 60% cache hit rate means 60% of your input tokens cost nothing.
LMCache	The open-source KV cache engine powering Tensormesh’s persistent caching layer. When requests share a prefix, LMCache lets later requests reuse the earlier request’s KV tensors instead of recomputing them.
External Storage	A persistent cache bucket that keeps your context warm across sessions. Unlike the default in-memory cache (which resets each session), External Storage persists KV cache entries so returning workloads hit the cache more often. Plans are Bronze, Silver, and Gold.

Inference

Term	Meaning
Serverless Inference	Calling a model via API with no infrastructure to manage. You send a request, Tensormesh routes it to available GPUs, and you get a response. Pay per token, no provisioning or idle costs.
Reserved Deployment	A dedicated GPU cluster that’s exclusively yours. Useful when you need guaranteed capacity, consistent low latency, or strict data isolation.
TTFT (Time to First Token)	How long it takes from when you send a request to when you receive the first token of the model’s reply. Cached prefixes reduce TTFT because the model skips recomputing tokens it’s already seen.
Streaming	Receiving the model’s response token by token as it’s generated, rather than waiting for the full reply. Enable it by adding `"stream": true` to your request.
MoE (Mixture of Experts)	A model architecture where only a fraction of the model’s parameters are activated per token. For example, a 397B MoE model may only activate ~17B parameters per token — making it fast and cost-effective despite its large total size.

API & Platform

Term	Meaning
OpenAI-Compatible API	Tensormesh’s API uses the same format as OpenAI’s. Set `base_url` to `https://serverless.tensormesh.ai`, swap in your Tensormesh API key, and nothing else changes.
API Key	A secret token included in every request so Tensormesh knows who you are. Generate one under Management → Account → API Keys. Treat it like a password.

Additional Resources

⌘I