Skip to main content
The Metrics tab provides critical insights into your LLM deployment’s efficiency and responsiveness. Understanding these metrics helps you optimize costs, improve user experience, and scale effectively.

Accessing Metrics

Navigate to Operations → Dashboard → Select a Deployment → Metrics Tab to view performance data. Metrics

Time Range Selection

Metrics can be viewed across different time windows: Last Hour — Real-time monitoring for immediate performance insights
Last Day — Daily trends and patterns
Last Week — Longer-term performance analysis

Key Performance Metrics

What it measures

Percentage of requests that retrieve precomputed results from the KV cache rather than recomputing them.

Why it matters

Higher cache hit rates translate directly to reduced computation costs and improved performance. This is Tensormesh’s primary cost-saving mechanism.A cache hit occurs when an input segment (a reused context or prefix) already exists in the KV cache backend and can be retrieved without recomputation.

Interpretation

  • High cache hit rate (above 70%): Excellent - you’re maximizing cost efficiency
  • Moderate cache hit rate (30-70%): Good - there’s room for optimization through prompt engineering
  • Low cache hit rate (below 30%): Consider reviewing request patterns for potential caching opportunities
Maximize cache hits by structuring prompts with consistent prefixes or system messages that can be reused across requests.

What it measures

The time interval from when a user query arrives until the first output token is generated.

Why it matters

TTFT directly impacts perceived responsiveness in interactive applications like chatbots and document analysis tools. Lower TTFT creates a better user experience.

Typical range

  • Sub-second: Cached queries with high cache hit rates
  • 1-3 seconds: New queries depending on model size and input length
  • Above 3 seconds: Large models with long input contexts

Optimization tips

  • Leverage caching for frequently used contexts
  • Consider smaller models for latency-sensitive applications
  • Use streaming to display partial results early

What it measures

Average delay between consecutive output tokens during generation.

Why it matters

Lower ITL means responses stream to users more quickly, creating smoother, more natural interactions.

Focus area

ITL reflects the decoding phase efficiency of LLM inference. Consistent, low ITL values ensure smooth streaming output.

What to watch for

  • Spikes in ITL: May indicate GPU memory pressure or competing workloads
  • Consistent low ITL: Indicates healthy decoding performance
  • Gradual increases: Could signal need to scale replicas

What it measures

Rate at which the system processes incoming requests during the prefill phase, commonly measured as Queries Per Second (QPS).

Why it matters

High input throughput is critical for scaling LLM services to handle many simultaneous users or long input contexts. The prefill computation is typically the primary bottleneck.

Optimization

Tensormesh’s caching significantly improves input throughput by reusing cached prefixes, reducing the computational load during the prefill phase.

Scaling considerations

  • Low throughput with high demand: Consider adding more replicas
  • Variable throughput: Review request patterns and cache effectiveness
  • Consistently high throughput: Your deployment is well-sized for current load

What it measures

Rate at which the system generates tokens, measured in tokens per second.

Why it matters

Higher output throughput means complete responses are generated faster, improving overall system capacity and user satisfaction.

Focus area

Tied to the efficiency of the decoding phase in the inference process.

Performance indicators

  • Stable output throughput: Indicates consistent model performance
  • Declining throughput: May signal resource constraints or increased load
  • High variability: Could indicate batch size or scheduling inefficiencies

What it measures

Percentage of GPU compute resources being actively used for model inference.

Why it matters

GPU utilization helps you understand resource efficiency and identify potential bottlenecks or over-provisioning.

Optimal ranges

  • 60-85%: Healthy utilization with headroom for traffic spikes
  • 85-95%: Near-optimal usage, monitor for saturation
  • Above 95%: At capacity, consider scaling up
  • Below 40%: Potentially over-provisioned, consider scaling down
Consistently high GPU utilization (above 90%) may lead to increased latency during traffic spikes. Consider adding replicas proactively.

What it measures

KV cache utilization relative to total available cache capacity.

Why it matters

Monitors key-value cache utilization, which is critical for managing long-context requests and optimizing memory usage.

What to monitor

  • Low usage (below 50%): Cache is underutilized, consider caching more aggressively
  • Moderate usage (50-80%): Healthy cache utilization
  • High usage (above 80%): Approaching capacity limits, monitor eviction rates
  • Near capacity (above 95%): May experience cache thrashing or evictions

Optimization strategies

  • Analyze request patterns to identify cacheable contexts
  • Adjust cache eviction policies if needed
  • Scale cache capacity for high-reuse workloads

Understanding Metric Relationships

Performance metrics are interconnected. Understanding these relationships helps you make informed optimization decisions.

Cache Hit Rate Impact

When cache hit rate increases:
  • TTFT decreases (faster first token delivery)
  • Input throughput increases (more requests processed efficiently)
  • GPU utilization decreases (less computation needed)

Throughput and Latency Trade-offs

Consider these performance trade-offs:
  • Higher input throughput may increase TTFT during peak load due to queuing
  • Increased GPU utilization may lead to higher ITL from resource contention
  • Adding replicas improves throughput and reduces individual request latency

Troubleshooting Common Issues

Possible causes:
  • Low cache hit rate
  • Large input contexts
  • GPU resource saturation
  • Network latency
Solutions:
  • Optimize prompts for better caching
  • Add more replicas to distribute load
  • Consider smaller models for latency-sensitive use cases
Possible causes:
  • Highly variable input patterns
  • No shared prompt prefixes
  • Cache capacity issues
Solutions:
  • Structure prompts with consistent system messages
  • Implement prompt templates for common queries
  • Review cache eviction policies
Possible causes:
  • GPU memory pressure
  • Competing workloads
  • Network instability
Solutions:
  • Monitor GPU utilization for saturation
  • Check for other deployments on shared resources
  • Review network connectivity and latency
Possible causes:
  • Insufficient replicas
  • Long generation times
  • Batch size configuration
Solutions:
  • Scale up replica count
  • Optimize generation parameters (temperature, max_tokens)
  • Review batch processing configuration