Skip to main content
This page describes the metrics exposed in Prometheus format by the inference router and related services. These are numeric time series intended for dashboards, SLOs, and alerts.

Inference router metrics

Inference router metrics describe queues, scheduling, and request lifecycle in the core inference layer.

Inference router metrics table

MetricCategoryPrometheus NameDescriptionGranularity
Queue lengthQueuequeue_lengthNumber of requests currently queued in the router.Per model, QoS, and/or user
Max queue wait timeQueuequeue_max_wait_secondsMaximum age (seconds) of any request currently in the queue.Per model, QoS
Customer queue lengthQueuecustomer_queue_lengthQueue length per customer per model.Per user, model
Submitted requestsTrafficsubmitted_totalTotal number of requests submitted to the router.Per model, QoS, user, status
Completed requestsTrafficcompleted_totalTotal number of completed requests, labeled with completion status (success, error, etc.).Per model, QoS, user, status
Response codesTrafficresponse_code_totalCount of HTTP responses by status code.Per HTTP code, route, user
Response latencyLatencyresponse_duration_msEnd-to-end response latency in milliseconds (often as a histogram or summary).Per model, QoS, customer
Connection stateWorkersconnection_state_ratioFraction of workers in each state (idle, busy, draining, unhealthy, etc.).Per worker state, model, pool
Active usersAdoptionactive_usersNumber of active users observed by the router.Global and/or per user
Metric names and label sets may evolve over time. Refer to the release notes for changes in metric schema.