Monitoring Performance

The Metrics tab provides critical insights into your LLM deployment’s efficiency and responsiveness. Understanding these metrics helps you optimize costs, improve user experience, and scale effectively.

Accessing Metrics

Navigate to Operations → Dashboard → Select a Deployment → Metrics Tab to view performance data.

Time Range Selection

Metrics can be viewed across different time windows: Last Hour — Real-time monitoring for immediate performance insights
Last Day — Daily trends and patterns
Last Week — Longer-term performance analysis

Key Performance Metrics

Cache Hit Rate

What it measures

Percentage of requests that retrieve precomputed results from the KV cache rather than recomputing them.

Why it matters

Higher cache hit rates translate directly to reduced computation costs and improved performance. This is Tensormesh’s primary cost-saving mechanism.A cache hit occurs when an input segment (a reused context or prefix) already exists in the KV cache backend and can be retrieved without recomputation.

Interpretation

High cache hit rate (above 70%): Excellent - you’re maximizing cost efficiency
Moderate cache hit rate (30-70%): Good - there’s room for optimization through prompt engineering
Low cache hit rate (below 30%): Consider reviewing request patterns for potential caching opportunities

Maximize cache hits by structuring prompts with consistent prefixes or system messages that can be reused across requests.

Time to First Token (TTFT)

What it measures

The time interval from when a user query arrives until the first output token is generated.

Why it matters

TTFT directly impacts perceived responsiveness in interactive applications like chatbots and document analysis tools. Lower TTFT creates a better user experience.

Typical range

Sub-second: Cached queries with high cache hit rates
1-3 seconds: New queries depending on model size and input length
Above 3 seconds: Large models with long input contexts

Optimization tips

Leverage caching for frequently used contexts
Consider smaller models for latency-sensitive applications
Use streaming to display partial results early

Inter-Token Latency (ITL)

What it measures

Average delay between consecutive output tokens during generation.

Why it matters

Lower ITL means responses stream to users more quickly, creating smoother, more natural interactions.

Focus area

ITL reflects the decoding phase efficiency of LLM inference. Consistent, low ITL values ensure smooth streaming output.

What to watch for

Spikes in ITL: May indicate GPU memory pressure or competing workloads
Consistent low ITL: Indicates healthy decoding performance
Gradual increases: Could signal need to scale replicas

Input Throughput

What it measures

Rate at which the system processes incoming requests during the prefill phase, commonly measured as Queries Per Second (QPS).

Why it matters

High input throughput is critical for scaling LLM services to handle many simultaneous users or long input contexts. The prefill computation is typically the primary bottleneck.

Optimization

Tensormesh’s caching significantly improves input throughput by reusing cached prefixes, reducing the computational load during the prefill phase.

Scaling considerations

Low throughput with high demand: Consider adding more replicas
Variable throughput: Review request patterns and cache effectiveness
Consistently high throughput: Your deployment is well-sized for current load

Output Throughput

What it measures

Rate at which the system generates tokens, measured in tokens per second.

Why it matters

Higher output throughput means complete responses are generated faster, improving overall system capacity and user satisfaction.

Focus area

Tied to the efficiency of the decoding phase in the inference process.

Performance indicators

Stable output throughput: Indicates consistent model performance
Declining throughput: May signal resource constraints or increased load
High variability: Could indicate batch size or scheduling inefficiencies

GPU Compute Utilization

What it measures

Percentage of GPU compute resources being actively used for model inference.

Why it matters

GPU utilization helps you understand resource efficiency and identify potential bottlenecks or over-provisioning.

Optimal ranges

60-85%: Healthy utilization with headroom for traffic spikes
85-95%: Near-optimal usage, monitor for saturation
Above 95%: At capacity, consider scaling up
Below 40%: Potentially over-provisioned, consider scaling down

Consistently high GPU utilization (above 90%) may lead to increased latency during traffic spikes. Consider adding replicas proactively.

KV Cache Usage Ratio

What it measures

KV cache utilization relative to total available cache capacity.

Why it matters

Monitors key-value cache utilization, which is critical for managing long-context requests and optimizing memory usage.

What to monitor

Low usage (below 50%): Cache is underutilized, consider caching more aggressively
Moderate usage (50-80%): Healthy cache utilization
High usage (above 80%): Approaching capacity limits, monitor eviction rates
Near capacity (above 95%): May experience cache thrashing or evictions

Optimization strategies

Analyze request patterns to identify cacheable contexts
Adjust cache eviction policies if needed
Scale cache capacity for high-reuse workloads

Understanding Metric Relationships

Performance metrics are interconnected. Understanding these relationships helps you make informed optimization decisions.

Cache Hit Rate Impact

When cache hit rate increases:

TTFT decreases (faster first token delivery)
Input throughput increases (more requests processed efficiently)
GPU utilization decreases (less computation needed)

Throughput and Latency Trade-offs

Consider these performance trade-offs:

Higher input throughput may increase TTFT during peak load due to queuing
Increased GPU utilization may lead to higher ITL from resource contention
Adding replicas improves throughput and reduces individual request latency

Troubleshooting Common Issues

High TTFT (above 5 seconds)

Possible causes:

Low cache hit rate
Large input contexts
GPU resource saturation
Network latency

Solutions:

Optimize prompts for better caching
Add more replicas to distribute load
Consider smaller models for latency-sensitive use cases

Low Cache Hit Rate (below 20%)

Possible causes:

Highly variable input patterns
No shared prompt prefixes
Cache capacity issues

Solutions:

Structure prompts with consistent system messages
Implement prompt templates for common queries
Review cache eviction policies

Inconsistent ITL

Possible causes:

GPU memory pressure
Competing workloads
Network instability

Solutions:

Monitor GPU utilization for saturation
Check for other deployments on shared resources
Review network connectivity and latency

Low Throughput

Possible causes:

Insufficient replicas
Long generation times
Batch size configuration

Solutions:

Scale up replica count
Optimize generation parameters (temperature, max_tokens)
Review batch processing configuration

Overview

Get Started

Model Management

Pricing

Troubleshooting & Support

Accessing Metrics

Time Range Selection

Key Performance Metrics

What it measures

Why it matters

Interpretation

What it measures

Why it matters

Typical range

Optimization tips

What it measures

Why it matters

Focus area

What to watch for

What it measures

Why it matters

Optimization

Scaling considerations

What it measures

Why it matters

Focus area

Performance indicators

What it measures

Why it matters

Optimal ranges

What it measures

Why it matters

What to monitor

Optimization strategies

Understanding Metric Relationships

Cache Hit Rate Impact

Throughput and Latency Trade-offs

Troubleshooting Common Issues

Overview

Get Started

Model Management

Pricing

Troubleshooting & Support

​Accessing Metrics

​Time Range Selection

​Key Performance Metrics

​What it measures

​Why it matters

​Interpretation

​What it measures

​Why it matters

​Typical range

​Optimization tips

​What it measures

​Why it matters

​Focus area

​What to watch for

​What it measures

​Why it matters

​Optimization

​Scaling considerations

​What it measures

​Why it matters

​Focus area

​Performance indicators

​What it measures

​Why it matters

​Optimal ranges

​What it measures

​Why it matters

​What to monitor

​Optimization strategies

​Understanding Metric Relationships

​Cache Hit Rate Impact

​Throughput and Latency Trade-offs

​Troubleshooting Common Issues

Accessing Metrics

Time Range Selection

Key Performance Metrics

What it measures

Why it matters

Interpretation

What it measures

Why it matters

Typical range

Optimization tips

What it measures

Why it matters

Focus area

What to watch for

What it measures

Why it matters

Optimization

Scaling considerations

What it measures

Why it matters

Focus area

Performance indicators

What it measures

Why it matters

Optimal ranges

What it measures

Why it matters

What to monitor

Optimization strategies

Understanding Metric Relationships

Cache Hit Rate Impact

Throughput and Latency Trade-offs

Troubleshooting Common Issues