NativeLink
Deployment examples

Metrics & observability

Collect NativeLink OTLP metrics, scrape them with Prometheus, and dashboard them in Grafana.

NativeLink emits remote execution metrics through OpenTelemetry. Cache operation metrics are available when the store is explicitly wrapped with the opt-in cache_metrics store wrapper.

License note

NativeLink metrics are licensed under the Business Source License. Individual developer cache use does not need a commercial license. Teams using metrics in shared, production, or commercial settings can use NativeLink Cloud, Enterprise, or an intentionally very inexpensive separate license. See the license page.

Overview

NativeLink does not expose a Prometheus scrape endpoint directly. It emits OTLP metrics. To view those metrics in Prometheus, use one of these paths:

  1. NativeLink sends OTLP to an OpenTelemetry Collector, and Prometheus scrapes the Collector's Prometheus exporter endpoint.
  2. NativeLink sends OTLP/HTTP metrics directly to Prometheus with Prometheus' OTLP receiver enabled.

The metrics cover:

  • Cache performance: hit rates and operation latencies when cache_metrics is enabled.
  • Execution pipeline: queue times, stage durations, and success rates.
  • System health: worker utilization, throughput, and error rates.

Quick start

Start the local metrics stack:

git clone https://github.com/TraceMachina/nativelink
cd nativelink/deployment-examples/metrics
docker-compose up -d

Configure NativeLink to send OTLP metrics to the Collector:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_SERVICE_NAME=nativelink
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=dev,nativelink.instance_name=main"

Then start NativeLink with your config:

nativelink /path/to/config.json

In this flow, NativeLink sends OTLP to the Collector on :4317. The Collector serves Prometheus-format metrics on its Prometheus exporter endpoint, and Prometheus scrapes that endpoint.

Services:

  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3000
  • OTEL Collector metrics: http://localhost:8888/metrics

Deploy the Collector and Prometheus manifests:

kubectl apply -f deployment-examples/metrics/kubernetes/otel-collector.yaml
kubectl apply -f deployment-examples/metrics/kubernetes/prometheus.yaml

Configure the NativeLink deployment to export OTLP metrics:

env:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector:4317"
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: "grpc"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "deployment.environment=prod,k8s.cluster.name=main"

Start Prometheus with the OTLP receiver enabled:

prometheus \
  --web.enable-otlp-receiver \
  --storage.tsdb.out-of-order-time-window=30m \
  --config.file=prometheus.yml

Then point NativeLink at Prometheus' OTLP metrics endpoint:

export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://localhost:9090/api/v1/otlp/v1/metrics
export OTEL_SERVICE_NAME=nativelink
export OTEL_RESOURCE_ATTRIBUTES="service.instance.id=$(uuidgen)"
export OTEL_TRACES_EXPORTER=none
export OTEL_LOGS_EXPORTER=none

Enable cache metrics

To emit nativelink_cache_* metrics, wrap the CAS or AC store you want to measure:

{
  "name": "CAS_MAIN_STORE",
  "cache_metrics": {
    "cache_type": "cas",
    "backend": {
      "filesystem": {
        "content_path": "~/.cache/nativelink/content_path-cas",
        "temp_path": "~/.cache/nativelink/tmp_path-cas",
      },
    },
  },
}

If cache_metrics is absent, NativeLink constructs the same store graph as it would without cache metrics. The disabled path does not add a wrapper, timer, attribute allocation, or OpenTelemetry recording call to cache operations.

Metrics catalog

Cache metrics

Cache metrics are opt-in. These series are emitted only for stores wrapped with cache_metrics; configuring OTEL alone does not enable them.

MetricTypeDescriptionLabels
nativelink_cache_operations_totalCounterTotal cache operationscache_type, cache_operation_name, cache_operation_result
nativelink_cache_operation_durationHistogramOperation latency in millisecondscache_type, cache_operation_name
nativelink_cache_io_totalCounterBytes read or writtencache_type, cache_operation_name
nativelink_cache_sizeGaugeCurrent cache size in bytescache_type
nativelink_cache_entriesGaugeNumber of cached entriescache_type
nativelink_cache_item_sizeHistogramSize distribution of cache entriescache_type

Operation names include read, write, delete, and evict. Operation results include hit, miss, expired, success, and error.

Execution metrics

MetricTypeDescriptionLabels
nativelink_execution_stage_durationHistogramTime spent in each execution stageexecution_stage
nativelink_execution_total_durationHistogramTotal execution time from submission to completionexecution_instance
nativelink_execution_queue_timeHistogramTime spent waiting in queueexecution_priority
nativelink_execution_active_countGaugeCurrent actions in each stageexecution_stage
nativelink_execution_completed_count_totalCounterCompleted executionsexecution_result, execution_action_digest
nativelink_execution_stage_transitions_totalCounterStage transition eventsexecution_instance, execution_priority
nativelink_execution_output_sizeHistogramSize of execution outputs-
nativelink_execution_retry_count_totalCounterNumber of retries-

Execution stages are unknown, cache_check, queued, executing, and completed.

Counter metrics are exposed with a _total suffix in Prometheus. The Docker Compose quickstart, recording rules, and included dashboards assume _total counter names.

Example queries

Cache performance

sum(rate(nativelink_cache_operations_total{cache_operation_result="hit"}[5m])) by (cache_type) /
sum(rate(nativelink_cache_operations_total{cache_operation_name="read"}[5m])) by (cache_type)
histogram_quantile(0.95,
  sum(rate(nativelink_cache_operation_duration_bucket[5m])) by (le, cache_type)
)

Execution pipeline

sum(rate(nativelink_execution_completed_count_total{execution_result="success"}[5m])) /
sum(rate(nativelink_execution_completed_count_total[5m]))
sum(nativelink_execution_active_count{execution_stage="queued"})

Grafana dashboard

A reference dashboard lives at deployment-examples/metrics/grafana/dashboards/nativelink-overview.json. Import it via Grafana's UI:

  1. Dashboards -> New -> Import.
  2. Paste the JSON.
  3. Pick your Prometheus data source.

Alerting

Start with these signals and tune thresholds for your latency budget:

AlertThresholdNotes
HighErrorRateerror rate above 5% for 5 minutesCheck worker failures and scheduler logs.
QueueBacklogqueued actions above 100 for 15 minutesAdd workers or inspect worker matching.
CacheEvictionHigheviction rate above threshold for 10 minutesIncrease storage or tune eviction policy.

Troubleshooting

No metrics appear

  1. Verify the OTEL environment:

    env | grep OTEL_
  2. Check Collector health:

    curl http://localhost:13133/health
  3. Verify the Collector is receiving metrics:

    curl http://localhost:8888/metrics | grep otelcol_receiver

Cache metrics are missing

If you see nativelink_execution_* metrics but no nativelink_cache_* metrics, wrap the store you want to measure with cache_metrics. OTEL configuration alone does not enable store-level cache operation metrics.

Out-of-order samples

Increase the Prometheus window:

storage:
  tsdb:
    out_of_order_time_window: 1h

Additional resources