Production configurations
The NativeLink cluster shape that survives a real team — sharded CAS, autoscaling workers, mTLS, observability.
What changes between the basic single-node and a cluster you'd trust with your team's working day:
- The CAS is durable and shareable.
- Workers are stateless and replaceable.
- The control plane is HA.
- Every hop is encrypted and authenticated.
- The whole thing emits metrics you can dashboard.
This page is a tour of those concerns. Pick what you need; the bits that don't apply (e.g. multi-region if you only have one) just stay out.
CAS: sharded S3 + Redis
The recommended shape for any multi-node deployment. S3 (or compatible) is the durable backing store; Redis sits in front as a hot-key cache.
stores: [
// Redis tier — fast, ephemeral, hot keys.
{
name: "CAS_REDIS",
redis_store: {
addresses: ["redis://redis.svc.cluster.local:6379"],
key_prefix: "cas:",
response_timeout_ms: 200,
},
},
// S3 tier — durable, slower, cold storage.
{
name: "CAS_S3",
experimental_s3_store: {
region: "us-east-1",
bucket: "nativelink-prod-cas",
key_prefix: "cas/",
retry: { max_retries: 5, delay: 0.1, jitter: 0.5 },
},
},
// Two-tier composite: read fast, fall through to S3.
{
name: "CAS_TIERED",
fast_slow: {
fast: { ref_store: { name: "CAS_REDIS" } },
slow: { ref_store: { name: "CAS_S3" } },
},
},
// Compress + shard the tiered store. Now we're production.
{
name: "CAS_MAIN_STORE",
shard: {
stores: [
// 16 evenly-weighted shards. Pick a power of 2.
{ store: { ref_store: { name: "CAS_TIERED" } }, weight: 1 },
// ... repeat 15 more times, or generate at deploy time ...
],
},
},
],The shard layer hashes by blob digest, so reads and writes from different clients land on the same shard. The fast/slow layer guarantees the hot path is sub-millisecond.
Redis sizing
Aim for Redis to hold the last 24-48 hours of artifacts. For a team doing 1 TB of build traffic a day, that's ~20-40 GiB of Redis — trivially affordable. Anything older falls through to S3.
Action Cache: durable but disposable
The AC is small (typically <1% of CAS by bytes), but losing it means re-running every build until it warms back up. Run two replicas in front of a durable store:
{
name: "AC_MAIN_STORE",
experimental_s3_store: {
region: "us-east-1",
bucket: "nativelink-prod-ac",
key_prefix: "ac/",
},
},If your AC is on the same S3 cluster as your CAS, prefix-segment them. S3 LIST cost matters at scale.
Scheduler: HA with sticky workers
Schedulers themselves are stateless — run two or three behind a load balancer, restart at will. The state that matters is the in-flight action queue, which lives in the scheduler's memory and gets restored from the AC on restart.
schedulers: {
MAIN_SCHEDULER: {
simple: {
supported_platform_properties: {
OSFamily: "exact",
container_image: "exact",
cpu_count: "minimum",
gpu: "exact",
},
max_job_retries: 3,
worker_timeout_s: 300,
},
},
},supported_platform_properties is how the scheduler matches actions
to workers. An action requesting gpu: "nvidia-a100" will only be
dispatched to workers tagged with that exact value.
Workers: autoscaling, stateless
Production workers run on a separate fleet from the control plane. Each worker process advertises its platform properties; the scheduler sends matching actions.
workers: [{
local: {
worker_api_endpoint: { uri: "grpc://scheduler.svc.cluster.local:50051" },
cas_fast_slow_store: "CAS_MAIN_STORE",
upload_action_result: { upload_action_result: { ac_store: "AC_MAIN_STORE" } },
max_action_timeout: 3600,
platform_properties: {
OSFamily: { values: ["linux"] },
container_image: { query_cmd: "cat /etc/nativelink/image-tag" },
cpu_count: { values: ["8"] },
},
},
}],Drive worker count from queue depth — every common autoscaler (Kubernetes HPA on a custom metric, AWS Auto Scaling on a CloudWatch alarm) can read NativeLink's Prometheus output and scale accordingly. The Kubernetes deployment guide has a working HPA config.
TLS and authentication
In production, every hop runs over mTLS:
servers: [{
listener: {
http: {
socket_address: "0.0.0.0:50051",
tls: {
cert_file: "/etc/nativelink/tls/server.crt",
key_file: "/etc/nativelink/tls/server.key",
client_ca_file: "/etc/nativelink/tls/clients-ca.crt",
},
},
},
services: { /* ... */ },
}],Clients present a certificate; the scheduler matches the certificate's identity against an allow-list. Worker→scheduler connections use the same mTLS pattern.
For org-wide SSO/SAML, the
compliance page covers the
provider list.
Observability
NativeLink emits OpenTelemetry traces and Prometheus metrics out of the box. Enable them in the config:
global: {
metrics: { prometheus: { listen_address: "0.0.0.0:9090" } },
tracing: {
otlp: { endpoint: "http://otel-collector:4317" },
sample_rate: 0.05,
},
},What's worth alerting on:
nativelink_cas_request_duration_seconds_p99— sub-millisecond on a healthy cluster. Anything over 50ms is page-worthy.nativelink_scheduler_queue_depth— should drop quickly when workers are healthy.nativelink_worker_connected_count— sudden drops indicate scheduler-side issues.
The Metrics deployment guide has a Grafana dashboard JSON file you can import.
A reference production config
A working complete production config (CAS + AC + scheduler + workers,
mTLS, metrics) lives at
nativelink-config/examples/production.json5.
Start from there; trim what you don't need.