Skip to main content
SambaStack’s monitoring and observability reference architecture uses open-source components to provide metrics, logs, and dashboards for on-premises deployments. This guide describes how to deploy Prometheus, Grafana, the Prometheus Operator, and Node Exporter into a monitoring namespace using the official kube-prometheus-stack Helm chart.
Reference Architecture Note: This setup uses third-party components (Prometheus, Grafana, etc.). Versions, defaults, and command syntax may change over time. Address any issues not specific to SambaStack to the vendor or project that owns that component.

Prerequisites

Before you begin, ensure the following requirements are met:
  • kubectl — Configured with access to your target Kubernetes cluster
  • Helm (latest version) — Verify with helm version
  • jq — For parsing JSON output during verification
  • Monitoring namespace — If it does not exist, it will be created in Step 1
  • Storage class — A valid storage class for Prometheus persistent storage

Optional prerequisites

  • OpenSearch — Required only if you want Grafana to visualize logs. If deploying with OpenSearch integration, complete the OpenSearch deployment first. The opensearch-initial-admin-password secret must exist.
Deployment Order: If using the full monitoring stack, deploy in this order: OpenSearch → Fluent Bit → Prometheus/Grafana. If you only need metrics (no log visualization), you can deploy Prometheus/Grafana independently.

Resource requirements

The following are recommended minimum resources for the monitoring stack:
ComponentCPU RequestMemory RequestStorage
Prometheus (per replica)500m2Gi30Gi
Grafana250m512Mi
Node Exporter (per node)100m128Mi
Prometheus Operator200m256Mi
For larger deployments (100+ nodes or high-cardinality metrics), increase Prometheus memory to 4–8Gi and storage to 50–100Gi. Adjust retentionSize accordingly in the values file.

Architecture overview

ComponentPurpose
PrometheusCollects and stores time-series metrics from SambaStack services and cluster nodes
Prometheus OperatorManages Prometheus configuration via Kubernetes CRDs (ServiceMonitor, etc.)
Node ExporterExposes node/rack-level host metrics for Prometheus to scrape
GrafanaVisualizes Prometheus metrics and OpenSearch logs in dashboards

Data flows

Metrics path:
SambaStack services / Node Exporter → /metrics → Prometheus → Grafana dashboards
Logs path (requires OpenSearch):
Pods / system logs → Fluent Bit → OpenSearch → Grafana (log panels)

Deployment steps

Step 1: Create the monitoring namespace

Skip this step if the namespace already exists from an OpenSearch deployment. Create the namespace configuration file at ~/.sambastack-observability/monitoring-namespace.yaml:
# monitoring-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
Apply the configuration:
kubectl apply -f monitoring-namespace.yaml

Step 2: Add Helm repository

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Step 3: Create Grafana admin credentials secret

Create a secret with base64-encoded username and password at ~/.sambastack-observability/monitoring/grafana-initial-admin-credentials-secret.yaml:
# grafana-initial-admin-credentials-secret.yaml
apiVersion: v1
data:
  admin-user: <base64-encoded-username>
  admin-password: <base64-encoded-password>
kind: Secret
metadata:
  name: grafana-initial-admin-credentials
  namespace: monitoring
type: Opaque
To base64 encode a value: echo -n 'your-value' | base64
Apply the secret:
kubectl -n monitoring apply -f grafana-initial-admin-credentials-secret.yaml

Step 4: Create values file

Create the following file ~/.sambastack-observability/monitoring/prometheus-grafana-values.yaml. Choose the appropriate configuration based on whether you’re integrating with OpenSearch.
Use this configuration if you have deployed OpenSearch and want log visualization in Grafana.
This configuration requires the opensearch-initial-admin-password secret to exist in the monitoring namespace. See Log Storage - OpenSearch.
# prometheus-grafana-values.yaml (with OpenSearch)
alertmanager:
  enabled: false
kubeStateMetrics:
  enabled: false
kubernetesServiceMonitors:
  enabled: false
defaultRules:
  create: false
nodeExporter:
  enabled: true
prometheusOperator:
  serviceMonitor:
    selfMonitor: false
  nodeSelector: {}
  tolerations: []
  tls:
    enabled: false
  admissionWebhooks:
    enabled: false
prometheus:
  enabled: true
  serviceMonitor:
    selfMonitor: false
  prometheusSpec:
    replicas: 2
    nodeSelector: {}
    tolerations: []
    serviceMonitorSelector:
      matchLabels: {}
    ruleSelector:
      matchLabels: {}
    scrapeConfigSelector:
      matchLabels: {}
    retentionSize: 25GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: <your-storage-class>
          resources:
            requests:
              storage: 30Gi
grafana:
  enabled: true
  defaultDashboardsEnabled: false
  serviceMonitor:
    enabled: false
  admin:
    existingSecret: grafana-initial-admin-credentials
    userKey: admin-user
    passwordKey: admin-password
  extraEnvs:
  - name: OPENSEARCH_INITIAL_ADMIN_PASSWORD
    valueFrom:
      secretKeyRef:
        name: opensearch-initial-admin-password
        key: OPENSEARCH_INITIAL_ADMIN_PASSWORD
  plugins:
  - grafana-opensearch-datasource
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
        - name: Prometheus
          type: prometheus
          access: proxy
          url: http://kube-prometheus-stack-prometheus:9090
          isDefault: true
        - name: OpenSearch-Logs
          type: grafana-opensearch-datasource
          access: proxy
          url: https://opensearch-cluster-master.{{ .Release.Namespace }}.svc.cluster.local:9200
          withCredentials: true
          basicAuth: true
          basicAuthUser: admin
          basicAuthPassword: $__env{OPENSEARCH_INITIAL_ADMIN_PASSWORD}
          editable: true
          readOnly: false
          jsonData:
            timeField: "@timestamp"
            database: logs-7d
            tlsSkipVerify: true
            logLevelField: log_level
            logMessageField: message
            version: 2.12.0
            versionLabel: OpenSearch 2.12.0
            flavor: opensearch
            maxConcurrentShardRequests: 5
            pplEnabled: true
            serverless: false
Replace <your-storage-class> with your cluster’s storage class. To find available storage classes: kubectl get storageclass

Component configuration summary

ComponentStatusPurpose
Prometheus✓ EnabledMetrics storage and querying
Prometheus Operator✓ EnabledManages Prometheus config via CRDs
Grafana✓ EnabledVisualization for metrics and logs
Node Exporter✓ EnabledNode/rack-level host metrics
Alertmanager✗ DisabledNot enabled in this minimal reference

Step 5: Install kube-prometheus-stack

Run the Helm install command:
helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --version 60.2.0 \
  -n monitoring \
  -f prometheus-grafana-values.yaml
This deploys Prometheus, Prometheus Operator, Node Exporter, and Grafana.

Step 6: Create ServiceMonitor for inference router

Prometheus uses a ServiceMonitor to discover and scrape metrics from the SambaStack inference router. Create the following file ~/.sambastack-observability/monitoring/inference-router-sm.yaml.
# inference-router-sm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: inference-router-sm
  namespace: monitoring
spec:
  endpoints:
    - interval: 30s
      port: inference-router
      path: /v1/metrics
  selector:
    matchLabels:
      sambanova.ai/app: inference-router
Apply the ServiceMonitor:
kubectl apply -f inference-router-sm.yaml

Verification

Check pod status

Check Prometheus pods:
kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus
Expected output:
NAME                                    READY   STATUS    RESTARTS   AGE
prometheus-kube-prometheus-stack-prometheus-0   2/2     Running   0          5m
prometheus-kube-prometheus-stack-prometheus-1   2/2     Running   0          5m
Check Grafana pod:
kubectl -n monitoring get pods -l app.kubernetes.io/name=grafana
Expected output:
NAME                                          READY   STATUS    RESTARTS   AGE
kube-prometheus-stack-grafana-XXXXX-XXXXX     1/1     Running   0          5m
Check Node Exporter pods:
kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus-node-exporter

Access the UIs

Port-forward Prometheus:
kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090:9090 &
Access at: http://localhost:9090To stop: pkill -f "port-forward.*9090"

Retrieve Grafana credentials

kubectl -n monitoring get secret grafana-initial-admin-credentials -o json \
  | jq -r '.data | to_entries[] | "\(.key): \(.value | @base64d)"'

Verify Prometheus targets

  1. Access Prometheus UI at http://localhost:9090
  2. Navigate to Status → Targets
  3. Verify node-exporter targets show as UP
  4. Verify inference-router-sm target shows as UP (if SambaStack is running)

Verify Grafana datasources

  1. Log in to Grafana at http://localhost:8080
  2. Navigate to Connections → Data sources
  3. Verify Prometheus datasource shows “Data source is working”
  4. If using OpenSearch integration, verify OpenSearch-Logs datasource shows “Data source is working”

Success criteria

The installation is complete when:
  • Prometheus pods (2 replicas) are running in the monitoring namespace
  • Grafana pod is running and accessible
  • Node Exporter pods are running on all nodes
  • Prometheus shows Node Exporter targets as UP
  • Grafana accepts the configured admin credentials
  • Prometheus datasource in Grafana shows “Data source is working”
  • (If configured) OpenSearch datasource in Grafana shows “Data source is working”

Import Node Exporter dashboard

To visualize node/rack-level metrics from Node Exporter:
1

Log in to Grafana

Use the credentials retrieved in the verification section.
2

Navigate to Import

Go to Dashboards → New → Import.
3

Import Dashboard

Enter dashboard ID: 1860 and click Load.
4

Select Datasource

Select Prometheus as the datasource and click Import.
This dashboard includes: CPU usage, memory usage, disk I/O, network metrics, and node health.
Dashboard ID 1860 is the community “Node Exporter Full” dashboard. This is ideal for per-rack visibility in SambaStack deployments.

Configuration reference

ParameterDefaultDescription
prometheus.prometheusSpec.replicas2Number of Prometheus replicas for HA
prometheus.prometheusSpec.retentionSize25GBMaximum storage before oldest data is deleted
prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage30GiPVC size for Prometheus data
grafana.admin.existingSecretgrafana-initial-admin-credentialsSecret containing admin credentials
nodeExporter.enabledtrueDeploy Node Exporter DaemonSet

Troubleshooting

Prometheus pods stuck in Pending

Symptom: Prometheus pods remain in Pending status. Cause: PersistentVolumeClaim cannot be fulfilled. Solution:
# Check PVC status
kubectl -n monitoring get pvc

# Verify storage class exists
kubectl get storageclass

Grafana shows “Data source is not working” for OpenSearch

Symptom: OpenSearch datasource test fails in Grafana. Possible causes:
  1. OpenSearch not deployed: Deploy OpenSearch first. See Log Storage - OpenSearch.
  2. Secret missing: Verify the secret exists:
    kubectl -n monitoring get secret opensearch-initial-admin-password
    
  3. OpenSearch not ready: Check OpenSearch pod status:
    kubectl -n monitoring get pod opensearch-cluster-master-0
    

ServiceMonitor not scraping targets

Symptom: Custom ServiceMonitor targets don’t appear in Prometheus. Solution: Verify the ServiceMonitor is in the monitoring namespace and labels match:
kubectl -n monitoring get servicemonitor
kubectl -n monitoring describe servicemonitor inference-router-sm

Node Exporter pods not running on all nodes

Symptom: Fewer Node Exporter pods than cluster nodes. Cause: Nodes may have taints that prevent scheduling. Solution: Add tolerations to the Node Exporter configuration or remove node taints.

Next steps

After Prometheus and Grafana are running:
  1. Create custom dashboards — Build dashboards for SambaStack-specific metrics like inference latency, QPS, and accelerator utilization.
  2. Add more ServiceMonitors — Create ServiceMonitors for other SambaStack components that expose Prometheus metrics.
  3. Enable alerting — Configure Alertmanager for production monitoring. Update the values file to set alertmanager.enabled: true.
  4. Explore logs — If OpenSearch and Fluent Bit are deployed, use Grafana’s Explore feature to query the logs-7d index.

Cleanup

To remove the monitoring stack from your cluster: Uninstall the Helm release:
helm uninstall kube-prometheus-stack -n monitoring
Delete the Grafana credentials secret:
kubectl delete secret grafana-initial-admin-credentials -n monitoring
Delete Prometheus PersistentVolumeClaims to free storage:
kubectl -n monitoring delete pvc -l app.kubernetes.io/name=prometheus
Delete the ServiceMonitor:
kubectl delete servicemonitor inference-router-sm -n monitoring
Deleting the PVCs permanently removes all stored metrics data. Ensure you have exported any important metrics before proceeding.
If you’re removing the entire monitoring stack including OpenSearch and Fluent Bit, you can delete the entire namespace: kubectl delete namespace monitoring. This removes all resources but is irreversible.