Monitoring performance of a deployed model

The dashboard allows you to monitor the efficiency and responsiveness of your LLM deployment through the following metrics:

Cache Hit Rate: Percentage of cache hits over time.
Time To First Token (TTFT): Time taken to generate the first output token.
Inter-Token Latency (ITL): Time between consecutive tokens.
Input Throughput: Input tokens processed per second.
Output Throughput: Output tokens processed per second.

Cache Hit Rate The cache hit rate, also known as the prefix cache hit ratio, measures the efficiency of the caching system, specifically regarding the retrieval of precomputed Key-Value (KV) caches.

What is a Cache Hit? A cache hit occurs when a segment of an LLM input (a reused context or prefix) is already stored in the KV cache backend and is retrieved without re-computation.
Significance: A higher rate indicates a more efficient use of the cache, reducing computation and improving overall performance, especially for repeated or similar requests.

Time To First Token (TTFT) The Time-to-First-Token (TTFT) is a critical metric for user experience, defined as the time interval from when a user query arrives (or the start of the process) until the first output token is generated by the model.

Significance: TTFT directly impacts the interactive user experience in applications like chatbots and document analysis, as a shorter TTFT makes the application feel more responsive.

Inter-Token Latency (ITL) The Inter-Token Latency (ITL) is defined as the average delay between the generation of two consecutive output tokens.

Focus: ITL is primarily a concern during the decoding phase of LLM inference.
Significance: This metric reflects the steady-state speed of the model’s output generation. A lower ITL means the output streams to the user more quickly.

Input Throughput Input throughput refers to the rate at which the system can process incoming requests, specifically focusing on the efficiency of handling the input context during the prefill phase.

Common Measurement: Frequently measured as the query processing rate or Queries Per Second (QPS).
Significance: Achieving high input throughput is critical because the computation associated with processing the input is the primary bottleneck for scaling LLM services to handle many simultaneous users or long inputs.

Output Throughput Output throughput refers to the rate at which the system generates tokens for the user.

Common Measurement: Usually measured in tokens per second or indirectly reflected in the system’s overall Query Processing Rate (QPS).
Focus: This metric is intrinsically tied to the efficiency of the decoding phase of the LLM inference process.
Significance: A higher output throughput means the model can generate the full response faster.

Overview

Getting Started

Model Management

Pricing

Troubleshooting & Support