Accessing Metrics
Navigate to Operations → Dashboard → Select a Deployment → Metrics Tab to view performance data.
Time Range Selection
Metrics can be viewed across different time windows: Last Hour — Real-time monitoring for immediate performance insightsLast Day — Daily trends and patterns
Last Week — Longer-term performance analysis
Key Performance Metrics
Cache Hit Rate
Cache Hit Rate
What it measures
Percentage of requests that retrieve precomputed results from the KV cache rather than recomputing them.Why it matters
Higher cache hit rates translate directly to reduced computation costs and improved performance. This is Tensormesh’s primary cost-saving mechanism.A cache hit occurs when an input segment (a reused context or prefix) already exists in the KV cache backend and can be retrieved without recomputation.Interpretation
- High cache hit rate (above 70%): Excellent - you’re maximizing cost efficiency
- Moderate cache hit rate (30-70%): Good - there’s room for optimization through prompt engineering
- Low cache hit rate (below 30%): Consider reviewing request patterns for potential caching opportunities
Time to First Token (TTFT)
Time to First Token (TTFT)
What it measures
The time interval from when a user query arrives until the first output token is generated.Why it matters
TTFT directly impacts perceived responsiveness in interactive applications like chatbots and document analysis tools. Lower TTFT creates a better user experience.Typical range
- Sub-second: Cached queries with high cache hit rates
- 1-3 seconds: New queries depending on model size and input length
- Above 3 seconds: Large models with long input contexts
Optimization tips
- Leverage caching for frequently used contexts
- Consider smaller models for latency-sensitive applications
- Use streaming to display partial results early
Inter-Token Latency (ITL)
Inter-Token Latency (ITL)
What it measures
Average delay between consecutive output tokens during generation.Why it matters
Lower ITL means responses stream to users more quickly, creating smoother, more natural interactions.Focus area
ITL reflects the decoding phase efficiency of LLM inference. Consistent, low ITL values ensure smooth streaming output.What to watch for
- Spikes in ITL: May indicate GPU memory pressure or competing workloads
- Consistent low ITL: Indicates healthy decoding performance
- Gradual increases: Could signal need to scale replicas
Input Throughput
Input Throughput
What it measures
Rate at which the system processes incoming requests during the prefill phase, commonly measured as Queries Per Second (QPS).Why it matters
High input throughput is critical for scaling LLM services to handle many simultaneous users or long input contexts. The prefill computation is typically the primary bottleneck.Optimization
Tensormesh’s caching significantly improves input throughput by reusing cached prefixes, reducing the computational load during the prefill phase.Scaling considerations
- Low throughput with high demand: Consider adding more replicas
- Variable throughput: Review request patterns and cache effectiveness
- Consistently high throughput: Your deployment is well-sized for current load
Output Throughput
Output Throughput
What it measures
Rate at which the system generates tokens, measured in tokens per second.Why it matters
Higher output throughput means complete responses are generated faster, improving overall system capacity and user satisfaction.Focus area
Tied to the efficiency of the decoding phase in the inference process.Performance indicators
- Stable output throughput: Indicates consistent model performance
- Declining throughput: May signal resource constraints or increased load
- High variability: Could indicate batch size or scheduling inefficiencies
GPU Compute Utilization
GPU Compute Utilization
What it measures
Percentage of GPU compute resources being actively used for model inference.Why it matters
GPU utilization helps you understand resource efficiency and identify potential bottlenecks or over-provisioning.Optimal ranges
- 60-85%: Healthy utilization with headroom for traffic spikes
- 85-95%: Near-optimal usage, monitor for saturation
- Above 95%: At capacity, consider scaling up
- Below 40%: Potentially over-provisioned, consider scaling down
KV Cache Usage Ratio
KV Cache Usage Ratio
What it measures
KV cache utilization relative to total available cache capacity.Why it matters
Monitors key-value cache utilization, which is critical for managing long-context requests and optimizing memory usage.What to monitor
- Low usage (below 50%): Cache is underutilized, consider caching more aggressively
- Moderate usage (50-80%): Healthy cache utilization
- High usage (above 80%): Approaching capacity limits, monitor eviction rates
- Near capacity (above 95%): May experience cache thrashing or evictions
Optimization strategies
- Analyze request patterns to identify cacheable contexts
- Adjust cache eviction policies if needed
- Scale cache capacity for high-reuse workloads
Understanding Metric Relationships
Performance metrics are interconnected. Understanding these relationships helps you make informed optimization decisions.Cache Hit Rate Impact
When cache hit rate increases:- TTFT decreases (faster first token delivery)
- Input throughput increases (more requests processed efficiently)
- GPU utilization decreases (less computation needed)
Throughput and Latency Trade-offs
Consider these performance trade-offs:- Higher input throughput may increase TTFT during peak load due to queuing
- Increased GPU utilization may lead to higher ITL from resource contention
- Adding replicas improves throughput and reduces individual request latency
Troubleshooting Common Issues
High TTFT (above 5 seconds)
High TTFT (above 5 seconds)
Possible causes:
- Low cache hit rate
- Large input contexts
- GPU resource saturation
- Network latency
- Optimize prompts for better caching
- Add more replicas to distribute load
- Consider smaller models for latency-sensitive use cases
Low Cache Hit Rate (below 20%)
Low Cache Hit Rate (below 20%)
Possible causes:
- Highly variable input patterns
- No shared prompt prefixes
- Cache capacity issues
- Structure prompts with consistent system messages
- Implement prompt templates for common queries
- Review cache eviction policies
Inconsistent ITL
Inconsistent ITL
Possible causes:
- GPU memory pressure
- Competing workloads
- Network instability
- Monitor GPU utilization for saturation
- Check for other deployments on shared resources
- Review network connectivity and latency
Low Throughput
Low Throughput
Possible causes:
- Insufficient replicas
- Long generation times
- Batch size configuration
- Scale up replica count
- Optimize generation parameters (temperature, max_tokens)
- Review batch processing configuration

