redis_cache

Source Home

Enable with

redis_cache:github.com/dragoangel/coredns-redis-cache-plugin

redis_cache - shared L2 DNS cache backed by a Redis-compatible key-value store.

Description

redis_cache stores DNS responses in a shared Redis-compatible backend (Redis, Valkey, or any RESP-protocol server) so that multiple CoreDNS instances can amortize upstream lookups across the fleet — for example several pods in a Kubernetes cluster, or a fleet of node-local-dns daemons. It is intended to sit behind the built-in cache plugin, which stays as the L1 (in-process) cache; redis_cache is the L2 (networked) cache.

If the Redis backend is unreachable the plugin becomes a noop and lookups continue to flow through the rest of the chain. Writes never block the DNS reply (they run in a fire-and-forget goroutine on a detached context). Reads are bounded by the configured timeout read budget (default 500ms) — the GET + TTL pipeline, pool wait and any retries all share that single budget — so a stalled Redis adds at most one read timeout to a single DNS reply before the plugin falls through. Read and write errors are surfaced via the get_errors_total and set_errors_total metrics so a broken cache is distinguishable from a cold one.

Each response is cached for the duration of its record TTL, clamped into a configurable range: max(min, min(record_TTL, max)). Defaults are 1h max for positive responses and 30m max for denials, both with no minimum floor; raise or lower either bound via the success and denial directives.

Syntax

redis_cache [ZONES...] {
    success MAX_TTL [MIN_TTL]
    denial MAX_TTL [MIN_TTL]
    endpoint ENDPOINT
    read_endpoint ENDPOINT [ENDPOINT...]
    key_prefix STRING
    db NUMBER
    sentinel MASTER_NAME SENTINEL_ADDR [SENTINEL_ADDR...]
    cluster SEED_ADDR [SEED_ADDR...]
    read_from latency|random|primary
    username USERNAME
    password PASSWORD
    sentinel_username USERNAME
    sentinel_password PASSWORD
    timeout {
        connect DURATION
        read DURATION
        write DURATION
    }
    pool {
        size N
        min_idle N
        max_idle N
        max_active N
        max_idle_time DURATION
        max_lifetime DURATION
        wait_timeout DURATION
    }
    retries {
        max N
        min_backoff DURATION
        max_backoff DURATION
    }
    tcp_keepalive DURATION
    tls
    tls_cert PATH
    tls_key PATH
    tls_ca PATH
    tls_verify_chain BOOL
    tls_verify_hostname BOOL
    resolver ADDRESS
}

Each sub-directive can be omitted; when present, its own arguments are required. Bare redis_cache with no block attempts to connect to 127.0.0.1:6379 with default TTL bounds — useful only against a sidecar Redis on localhost; production deployments must specify at least one of endpoint, sentinel, or cluster. The chosen topology mode determines which other directives are valid; the parser errors at load time on conflicting combinations:

cluster mode rejects endpoint, read_endpoint, sentinel, and any db other than 0 (Redis Cluster only supports DB 0). Seed addresses come from cluster; the rest of the topology is discovered via CLUSTER SLOTS.
sentinel mode rejects endpoint and read_endpoint — the master and replicas are discovered via Sentinel.
Default mode (neither cluster nor sentinel): writes go to endpoint. With no read_endpoint, the same client serves reads. With one, that client serves reads. With ≥2, each GET picks a replica at random. Rejects read_from and sentinel_username / sentinel_password.
ZONES (positional) — zones to cache for. Defaults to the surrounding server-block zones.
success MAX_TTL [MIN_TTL] — override TTL bounds for positive responses. MAX_TTL caps the cache duration (default 1h). MIN_TTL sets a floor (default 0) — when the upstream record TTL is shorter than this value, the cache duration is raised to this floor. Each value accepts a Go duration (30s, 1h) or a bare integer (seconds); sub-second values like 500ms are rejected.
denial MAX_TTL [MIN_TTL] — same as success but for negative responses (NXDOMAIN/NODATA). Defaults: MAX_TTL 30m, MIN_TTL 0.
endpoint — write endpoint address (default 127.0.0.1:6379). Accepts IPs or hostnames. If a port is omitted, 6379 is assumed.
read_endpoint — one or more read-only replica addresses. GETs route here, SETs go to endpoint. With ≥2 replicas, each GET picks one at random.
key_prefix STRING — namespace prefix for cache keys (default cdrc). Keys are stored as <key_prefix>:<hex>; the : separator is appended automatically. Set to "" to disable the prefix entirely (bare hex keys on a dedicated instance). A trailing : in the configured value is trimmed so key_prefix mycache and key_prefix mycache: are equivalent.
db NUMBER — Redis logical database index for the data plane. Default 0. Not allowed in cluster mode (Redis Cluster supports only DB 0).
sentinel — enable Sentinel mode. Master Group Name is mandatory and must be followed by one or more sentinel addresses. The plugin discovers the current master and replicas via Sentinel (single quorum subscription); writes go to the master, reads pick a replica at random per GET.
cluster — enable Cluster mode. Takes one or more seed node addresses; the smart client discovers the full topology via CLUSTER SLOTS.
read_from — replica routing strategy in cluster mode. Only valid when cluster is set.
- latency (default) — pick the replica with the lowest measured RTT.
- random — pick a random replica.
- primary — read only from primaries (no replica reads).
username — ACL username for the data plane (primary, replicas, or cluster nodes). Optional.
password — AUTH password for the data plane. Optional.
sentinel_username — ACL username for the Sentinel API. Optional; only used in sentinel mode.
sentinel_password — AUTH password for the Sentinel API. Optional; only used in sentinel mode.
timeout — Redis connection and operation timeouts:
- connect — TCP dial timeout (default: 1s).
- read — per-command read timeout (default: 500ms).
- write — per-command write timeout (default: 2s).
pool — connection-pool tuning. Values are non-negative integers.
- size N — maximum sockets per client (default 10 × runtime.GOMAXPROCS()).
- min_idle N — minimum idle sockets to keep warm (default 0).
- max_idle N — maximum idle sockets (default 0 = unlimited).
- max_active N — hard cap on total open sockets including in-use (default 0 = unlimited).
- max_idle_time DURATION — close a connection that has been idle for this long (default 30m). Set to less than your load balancer / NAT idle drop window.
- max_lifetime DURATION — force-recycle any connection older than this regardless of activity (default 0 = no limit).
- wait_timeout DURATION — how long a query waits for a free pool connection before erroring (default 500ms).
retries — retry behavior for transient network errors:
- max N — number of retries per operation (default 1), 0 disables retries.
- min_backoff DURATION — initial backoff between retries (default 8ms — go-redis).
- max_backoff DURATION — cap on backoff between retries (default 512ms — go-redis). Constraint: min_backoff must not exceed max_backoff when both are set.
tcp_keepalive DURATION — TCP keepalive probe interval (default Go’s built-in). Set below your NAT / firewall / mesh idle-drop window to prevent silent kills.
tls — enable TLS. No args. Verifies the server cert against the OS trust store. Use tls_ca to override the trust store, tls_cert / tls_key for mTLS. Implicitly enabled by any other tls_* directive — bare tls is only needed when no other TLS knob is set. The TLS config applies to every connection the plugin opens (Sentinel API, master, replicas, cluster nodes); bundle CAs if planes use different roots.
tls_cert PATH — PEM client certificate for mTLS. Must be paired with tls_key.
tls_key PATH — PEM private key matching tls_cert.
tls_ca PATH — PEM CA file used to verify the server certificate. Replaces the OS trust store when set; use only when your server’s cert chains to a CA the OS doesn’t ship.
tls_verify_chain BOOL — verify the server certificate chains to a trusted root. Default on. Set to off to disable all server-cert verification (chain and hostname); use only for development or fully-trusted networks. Accepts on/off, true/false, yes/no, 1/0.
tls_verify_hostname BOOL — verify the server cert’s SAN/CN matches the dialed hostname. Default on. Workaround for topologies where the dialed name cannot match the cert SAN (per-pod certs, Cluster MOVED redirects, Sentinel master/replica discovery, VIP fronting); chain verification still runs. Properly-issued certs should not require this. Has no effect when tls_verify_chain is off. See the example below.
resolver ADDRESS — DNS server to use for resolving Redis endpoint hostnames instead of the system resolver. Useful in deployments where CoreDNS itself intercepts the system resolver (e.g. node-local-dns) and resolving the Redis service name through it would create a circular dependency. Set this to an upstream DNS service IP. Port defaults to 53.

Authentication

The data plane (Redis nodes) and the Sentinel API authenticate independently — credentials across the two planes may be the same or different. In each plane the auth mode follows the standard Redis convention:

neither set → unauthenticated.
password only → legacy AUTH <password> (matches requirepass on any version, or authenticates as the default user on ACL-enabled servers).
username + password → full ACL auth (Redis 6+ for the data plane, Sentinel 6.2+ for the Sentinel API).

Cache key isolation

The cache key is xxhash64(qclass || qtype || DO || CD || lowercase(qname)), namespaced by key_prefix. All five components are mixed into the hash and re-verified after each GET — a mismatch is treated as a miss, self-healed via async eviction, and reported via coredns_redis_cache_collisions_total.

Practical guarantees this gives operators running mixed-client traffic:

IN and CHAOS lookups (e.g. version.bind.) never share a slot with normal Internet queries for the same qname.
DNSSEC-aware (DO=1) and non-DNSSEC clients keep separate entries — neither receives the other’s response with extra or missing RRSIG / NSEC records.
DNSSEC-validating (CD=0) and validation-bypassing (CD=1) queries are isolated. A CD=1 query for a DNSSEC-bogus name cannot poison the cache against a CD=0 client that would have received SERVFAIL from a validating upstream.

Known Compatibility

The plugin speaks only standard RESP commands (AUTH, GET, SET … EX, TTL, EXPIRE, PING, plus CLUSTER SLOTS in cluster mode and SENTINEL get-master-addr-by-name in Sentinel mode), so it is expected to work with any reasonably complete Redis-protocol implementation.

Metrics

If monitoring is enabled (via the prometheus directive) the following metrics are exported:

coredns_redis_cache_hits_total{server} - The count of cache hits from Redis.
coredns_redis_cache_request_duration_seconds{server} — Histogram of the time (in seconds) each cache lookup took. The _count series is the total number of cache requests; derive misses from the request and hit counters.
coredns_redis_cache_get_errors_total{server,reason} - The count of errors when reading entries from Redis. See Error reasons below for the reason buckets.
coredns_redis_cache_set_errors_total{server,reason} - The count of errors when adding entries to Redis. Same reason buckets as get_errors_total.
coredns_redis_cache_encode_errors_total{server} - The count of DNS messages that could not be serialized to wire format and so were not cached.
coredns_redis_cache_response_mismatches_total{server} - The count of upstream replies whose question did not match the original request and were therefore refused for caching (the reply itself is still passed to the client). Non-zero suggests a misbehaving forwarder upstream or an attempted cache-poisoning probe.
coredns_redis_cache_collisions_total{server} - The count of cache hits whose stored entry did not match the request (qname/qtype/qclass/DO/CD all re-verified after GET; mismatched entries are treated as a miss and asynchronously evicted). Should be zero in normal operation. The only innocent trigger is a statistical xxhash64 collision, which is ≈2⁻⁶⁴ per pair and effectively never fires at any plausible cache size. A non-zero value therefore points to a bug to investigate — Redis returning the wrong key’s value, in-process mutation of cached bytes, or a coding error in this plugin — rather than something to ignore.

The server label indicates which server handled the request, see the metrics plugin for details.

Error reasons

get_errors_total and set_errors_total are bucketed by reason:

timeout - context deadline / cancellation, a network timeout, or a connection-pool wait timeout. Look at Redis latency / CPU, pool sizing, and the configured timeout read / pool wait_timeout budgets.
connection - non-timeout network failures: dial refused, connection reset, EOF mid-op. Look at connectivity (DNS, firewall, route), and whether Redis is up and accepting connections.
other - RESP-level errors (NOAUTH, WRONGPASS, parse failures, unhandled MOVED, etc.) or anything that isn’t a network error. Typically a configuration or code issue rather than a transient outage.

Examples

Examples after the first show only the redis_cache { ... } block; wrap it in the same . { cache {...} … forward . … } shape from the Standalone example. They also omit success / denial — reuse the values from Standalone or rely on the defaults documented in the directive list.

Local L1 plus a shared Redis L2:

. {
    cache {
        success 9984 30
        denial 9984 5
    }
    redis_cache {
        endpoint redis.cache.svc.cluster.local:6379
        success 1h 1m
        denial 30m 30s
    }
    forward . 8.8.8.8:53
}

Writes to a known master, reads random-balanced across explicit replicas:

redis_cache {
    endpoint 10.0.0.1:6379
    read_endpoint 10.0.0.2:6379 10.0.0.3:6379
    password secretPass
}

Sentinel with separate data-plane and Sentinel-API passwords:

redis_cache {
    sentinel mymaster 10.0.0.1:26379 10.0.0.2:26379 10.0.0.3:26379
    password masterReplicaPass
    sentinel_password sentinelPass
}

Redis 6+ ACL (username + password):

redis_cache {
    endpoint redis.cache.svc.cluster.local:6379
    username dns-cache
    password s3cret
}

Cluster mode for capacity scaling beyond a single node’s RAM:

redis_cache {
    cluster valkey-cluster-0:6379 valkey-cluster-1:6379 valkey-cluster-2:6379
    password secretPass
    read_from latency
}

Kubernetes note: the smart client connects directly to every primary and replica the seeds advertise via CLUSTER SLOTS. If nodes advertise pod IPs (chart default), ensure they’re routable from CoreDNS pods, or set cluster-announce-hostname on each node so the announced addresses match what resolver resolves.

TLS — server-only, OS trust store, no client cert:

redis_cache {
    endpoint redis.example.com:6380
    tls
    password s3cret
}

TLS — server-only, internal CA:

redis_cache {
    endpoint redis.example.com:6380
    tls_ca /etc/ssl/certs/redis-ca.pem
    password s3cret
}

TLS — mTLS:

redis_cache {
    endpoint redis.cache.svc.cluster.local:6379
    username dns-cache
    password s3cret
    tls_cert /etc/redis/tls/client.crt
    tls_key  /etc/redis/tls/client.key
    tls_ca   /etc/redis/tls/ca.pem
}

TLS — Kubernetes Redis Cluster with per-pod certs. Workaround for setups where issuing certs whose SAN matches the dialed name is not practical: a StatefulSet-deployed Redis/Valkey cluster typically presents per-pod certs (SAN = <pod>.<headless-svc>.<ns>.svc.cluster.local), the client dials a service name, and Cluster MOVED redirects further route to peers whose SANs won’t match anything pre-declared. Chain verification still applies to every peer:

redis_cache {
    cluster redis-cluster-0.redis-cluster-headless.cache.svc.cluster.local:6379 \
            redis-cluster-1.redis-cluster-headless.cache.svc.cluster.local:6379 \
            redis-cluster-2.redis-cluster-headless.cache.svc.cluster.local:6379
    tls_ca              /etc/redis/tls/ca.pem
    tls_verify_hostname off
    password            s3cret
}

Same workaround applies to Sentinel-discovered masters/replicas and HA-proxy/VIP fronting a fleet of per-pod certs. Prefer issuing certs whose SAN covers the dialed name where you control the PKI.

Kubernetes node-local-dns. When CoreDNS itself intercepts the cluster DNS VIP, resolving the Redis service name through it would loop. Use resolver to point at the upstream kube-dns; __PILLAR__CLUSTER__DNS__ is substituted by node-local-dns at runtime:

.:53 {
    errors
    cache {
        success 9984 30
        denial 9984 5
    }
    redis_cache {
        endpoint k8s-dns-cache-redis-master.k8s-dns-cache.svc.cluster.local:6379
        read_endpoint k8s-dns-cache-redis-replicas.k8s-dns-cache.svc.cluster.local:6379
        password secretPass
        success 1h 1m
        denial 30m 30s
        resolver __PILLAR__CLUSTER__DNS__
    }
    forward . __PILLAR__UPSTREAM__SERVERS__
}

Building

Add this line to CoreDNS’s plugin.cfg. It must appear after the cache:cache line so the in-process cache runs as L1 and redis_cache as L2:

cache:cache
redis_cache:github.com/dragoangel/coredns-redis-cache-plugin

Then go get "github.com/dragoangel/coredns-redis-cache-plugin@latest" go generate coredns.go && go build in the CoreDNS source tree.