Monitoring & Observability¶
NovaRoute exposes Prometheus metrics, health endpoints, and a gRPC event stream to give you full visibility into the routing control plane.
Prometheus Metrics¶
The metrics endpoint is served at :9102/metrics by default. Configure the listen address with the metrics_address field in your config JSON.
Add standard Prometheus scrape annotations to your pod:
Counters¶
| Metric | Labels | Description |
|---|---|---|
novaroute_intents_total |
owner, type, operation |
Intent operations counted by owner, type (peer, prefix, bfd, ospf), and operation (set, remove). |
novaroute_frr_transactions_total |
result |
FRR vtysh operations by result (success, failure). |
novaroute_policy_violations_total |
owner, reason |
Policy check failures by owner and reason (invalid_token, peer_operation_denied, bfd_operation_denied, ospf_operation_denied, prefix_denied, conflict). |
novaroute_events_total |
type |
Events emitted by type (e.g. EVENT_TYPE_PEER_UP, EVENT_TYPE_PEER_DOWN). |
novaroute_events_dropped_total |
Events dropped because a subscriber could not keep up. | |
novaroute_monitoring_errors_total |
protocol |
FRR state monitoring errors by protocol (bgp, bfd, ospf). |
Gauges¶
| Metric | Labels | Description |
|---|---|---|
novaroute_active_peers |
owner |
Current number of BGP peers per owner. |
novaroute_active_prefixes |
owner, protocol |
Currently advertised prefixes per owner and protocol. |
novaroute_active_bfd_sessions |
owner |
Active BFD sessions per owner. |
novaroute_active_ospf_interfaces |
owner |
OSPF interfaces per owner. |
novaroute_registered_owners |
Total number of registered owners. | |
novaroute_frr_connected |
FRR connection status. 1 means connected, 0 means disconnected. |
Histograms¶
| Metric | Labels | Description |
|---|---|---|
novaroute_grpc_request_duration_seconds |
method |
gRPC request latency per RPC method. |
novaroute_frr_transaction_duration_seconds |
Latency of individual FRR vtysh operations. | |
novaroute_reconcile_cycle_duration_seconds |
Duration of each reconciliation loop cycle. |
Example PromQL Queries¶
Intent operations per second by type:
FRR transaction failure rate:
rate(novaroute_frr_transactions_total{result="failure"}[5m])
/ on() rate(novaroute_frr_transactions_total[5m])
95th percentile gRPC latency per method:
Policy violation rate by reason:
Active BGP peers across all owners:
Reconcile cycle 99th percentile duration:
Suggested Grafana Panels¶
Overview Row¶
- Registered Owners -- Stat panel showing
novaroute_registered_owners. - FRR Connection Status -- Stat panel with value mapping:
novaroute_frr_connectedwhere1= Connected (green) and0= Disconnected (red). - Active BGP Peers -- Stat panel showing
sum(novaroute_active_peers).
BGP / BFD / OSPF Row¶
- Peers by Owner -- Bar gauge of
novaroute_active_peersgrouped byowner. - Prefixes by Owner and Protocol -- Table of
novaroute_active_prefixeswithownerandprotocolcolumns. - BFD Sessions -- Time series of
novaroute_active_bfd_sessionsover time. - OSPF Interfaces -- Time series of
novaroute_active_ospf_interfacesover time.
Performance Row¶
- gRPC Latency (p95) -- Time series of
histogram_quantile(0.95, rate(novaroute_grpc_request_duration_seconds_bucket[5m]))grouped bymethod. - Reconcile Cycle Duration (p99) -- Time series of reconcile cycle histogram quantile.
- FRR Transaction Duration (p95) -- Time series of FRR operation latency histogram quantile.
Error Rate Row¶
- FRR Transaction Failures -- Time series of
rate(novaroute_frr_transactions_total{result="failure"}[5m]). - Policy Violations by Reason -- Stacked time series of
rate(novaroute_policy_violations_total[5m])grouped byreason. - Monitoring Errors by Protocol -- Time series of
rate(novaroute_monitoring_errors_total[5m])grouped byprotocol. - Dropped Events -- Time series of
rate(novaroute_events_dropped_total[5m]).
Alerting Rules¶
groups:
- name: novaroute
rules:
- alert: NovaRouteFRRDisconnected
expr: novaroute_frr_connected == 0
for: 1m
labels:
severity: critical
annotations:
summary: "NovaRoute lost connection to FRR"
description: "The FRR sidecar has been unreachable for more than 1 minute on {{ $labels.instance }}."
- alert: NovaRouteHighPolicyViolations
expr: rate(novaroute_policy_violations_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High rate of policy violations"
description: "Owner {{ $labels.owner }} is generating policy violations (reason: {{ $labels.reason }}) at {{ $value }}/s."
- alert: NovaRouteReconcileFailures
expr: rate(novaroute_frr_transactions_total{result="failure"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "FRR transaction failures detected"
description: "FRR vtysh transactions are failing at {{ $value }}/s on {{ $labels.instance }}."
- alert: NovaRouteReconcileSlow
expr: histogram_quantile(0.99, rate(novaroute_reconcile_cycle_duration_seconds_bucket[5m])) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Reconciliation cycles are slow"
description: "The p99 reconcile cycle duration exceeds 5 seconds on {{ $labels.instance }}."
- alert: NovaRoutePeerDown
expr: delta(novaroute_active_peers[5m]) < -1
for: 2m
labels:
severity: warning
annotations:
summary: "BGP peers lost"
description: "Owner {{ $labels.owner }} lost peers in the last 5 minutes on {{ $labels.instance }}."
- alert: NovaRouteEventsDropped
expr: rate(novaroute_events_dropped_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Events are being dropped"
description: "Slow subscribers are causing event drops at {{ $value }}/s on {{ $labels.instance }}."
Health Endpoints¶
NovaRoute exposes two HTTP health endpoints on the metrics port (:9102):
| Endpoint | Purpose | Healthy | Unhealthy |
|---|---|---|---|
/healthz |
Liveness probe | Always returns 200 OK. |
N/A -- always healthy if the process is running. |
/readyz |
Readiness probe | 200 OK when FRR is connected. |
503 Service Unavailable when FRR is disconnected. |
Use these in your Kubernetes pod spec:
livenessProbe:
httpGet:
path: /healthz
port: 9102
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 9102
initialDelaySeconds: 5
periodSeconds: 5
Event Streaming¶
The StreamEvents gRPC RPC provides a real-time stream of routing state changes. The event types are defined in the EventType proto enum:
EVENT_TYPE_UNSPECIFIED-- Default/unset valueEVENT_TYPE_PEER_UP-- BGP peer session reached Established stateEVENT_TYPE_PEER_DOWN-- BGP peer session droppedEVENT_TYPE_PREFIX_ADVERTISED-- Prefix was successfully advertised to FRREVENT_TYPE_PREFIX_WITHDRAWN-- Prefix was withdrawn from FRREVENT_TYPE_BFD_UP-- BFD session reached Up stateEVENT_TYPE_BFD_DOWN-- BFD session went DownEVENT_TYPE_OSPF_NEIGHBOR_UP-- OSPF neighbor adjacency formedEVENT_TYPE_OSPF_NEIGHBOR_DOWN-- OSPF neighbor adjacency lostEVENT_TYPE_FRR_CONNECTED-- Agent connected to FRR daemonEVENT_TYPE_FRR_DISCONNECTED-- Agent lost connection to FRR daemonEVENT_TYPE_OWNER_REGISTERED-- An owner registered a sessionEVENT_TYPE_OWNER_DEREGISTERED-- An owner deregisteredEVENT_TYPE_POLICY_VIOLATION-- A request was rejected by prefix policyEVENT_TYPE_BGP_CONFIG_CHANGED-- BGP global configuration (AS/router-id) changed
Filtering¶
You can filter the event stream by owner and/or event type in the StreamEventsRequest:
message StreamEventsRequest {
string owner_filter = 1; // optional: only events for this owner
repeated string event_types = 2; // optional: only these event types
}
CLI Usage¶
# Stream all events
novaroutectl events
# Stream events for a specific owner
novaroutectl events --owner=my-controller
# Stream only peer events
novaroutectl events --types=PEER_UP,PEER_DOWN
If a subscriber cannot consume events fast enough, events are dropped and counted by the novaroute_events_dropped_total metric.