Monitoring BIND DNS with Prometheus and Grafana
A well-monitored DNS infrastructure provides visibility into query patterns, performance bottlenecks, and potential security issues. This final post in the BIND series covers setting up comprehensive monitoring using Prometheus and Grafana, giving you dashboards and alerts for your DNS servers.
BIND Statistics Overview
BIND provides extensive statistics through several channels:
- Statistics channel - XML/JSON HTTP endpoint
- Query logging - Detailed query/response logs
- RNDC statistics - Command-line statistics dump
For Prometheus integration, we'll use the bind_exporter to scrape the statistics channel.
Enabling BIND Statistics Channel
Configure Statistics in BIND
// /etc/bind/named.conf.options
options {
directory "/var/cache/bind";
// Enable statistics channel
statistics-channels {
inet 127.0.0.1 port 8053 allow { localhost; };
inet ::1 port 8053 allow { localhost; };
};
// Enable zone statistics
zone-statistics full;
// Other options...
recursion yes;
dnssec-validation auto;
};
Verify Statistics Channel
# Restart BIND
systemctl restart named
# Test statistics endpoint (XML)
curl http://127.0.0.1:8053/
# Get XML statistics
curl http://127.0.0.1:8053/xml/v3/server
# Get JSON statistics (BIND 9.10+)
curl http://127.0.0.1:8053/json/v1/server
Installing bind_exporter
The bind_exporter converts BIND statistics to Prometheus metrics format.
Download and Install
# Download latest release
wget https://github.com/prometheus-community/bind_exporter/releases/download/v0.7.0/bind_exporter-0.7.0.linux-amd64.tar.gz
# Extract
tar xzf bind_exporter-0.7.0.linux-amd64.tar.gz
mv bind_exporter-0.7.0.linux-amd64/bind_exporter /usr/local/bin/
# Set permissions
chmod +x /usr/local/bin/bind_exporter
Create Systemd Service
# /etc/systemd/system/bind_exporter.service
[Unit]
Description=BIND Exporter for Prometheus
After=network.target named.service
Wants=named.service
[Service]
Type=simple
User=nobody
Group=nogroup
ExecStart=/usr/local/bin/bind_exporter \
--bind.stats-url=http://127.0.0.1:8053/ \
--bind.stats-groups=server,view,tasks \
--web.listen-address=:9119
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable bind_exporter
systemctl start bind_exporter
# Test metrics endpoint
curl http://localhost:9119/metrics
Available Statistics Groups
The exporter supports these statistics groups:
| Group | Description |
|---|---|
| server | Overall server statistics |
| view | Per-view statistics |
| tasks | Task manager statistics |
Prometheus Configuration
Add Scrape Target
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'bind'
static_configs:
- targets:
- 'dns1.example.com:9119'
- 'dns2.example.com:9119'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+).*'
replacement: '${1}'
# For multiple BIND servers with single exporter
- job_name: 'bind-multi'
static_configs:
- targets:
- 'dns1.example.com'
- 'dns2.example.com'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 'prometheus-server:9119'
Reload Prometheus
kill -HUP $(pidof prometheus)
# Or
systemctl reload prometheus
Key Metrics to Monitor
Query Metrics
# Total queries received
bind_incoming_queries_total
# Queries by type (A, AAAA, MX, etc.)
bind_incoming_queries_total{type="A"}
# Query rate (queries per second)
rate(bind_incoming_queries_total[5m])
# Queries by result code
bind_responses_total{result="NOERROR"}
bind_responses_total{result="NXDOMAIN"}
bind_responses_total{result="SERVFAIL"}
Resolver Metrics
# Cache hit rate
bind_resolver_cache_hits_total /
(bind_resolver_cache_hits_total + bind_resolver_cache_misses_total)
# Recursive queries sent
rate(bind_resolver_queries_total[5m])
# Response time distribution
bind_resolver_response_time_seconds_bucket
Server Health Metrics
# Current number of recursive clients
bind_recursive_clients
# TCP connections
bind_tcp_connections_total
# Zone transfers
bind_zone_transfer_success_total
bind_zone_transfer_failure_total
Grafana Dashboard
Import Pre-built Dashboard
The BIND Exporter community provides dashboards:
- In Grafana, go to Dashboards -> Import
- Enter dashboard ID: 12309 (BIND Exporter)
- Select your Prometheus data source
- Click Import
Custom Dashboard Panels
Query Rate Panel
{
"title": "DNS Query Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(bind_incoming_queries_total[5m])) by (instance)",
"legendFormat": "{{instance}}"
}
],
"yAxes": [
{
"label": "Queries/sec",
"format": "short"
}
]
}
Query Type Distribution
{
"title": "Query Types",
"type": "piechart",
"targets": [
{
"expr": "sum(increase(bind_incoming_queries_total[1h])) by (type)",
"legendFormat": "{{type}}"
}
]
}
Response Codes
{
"title": "Response Codes",
"type": "stat",
"targets": [
{
"expr": "sum(rate(bind_responses_total{result=\"NOERROR\"}[5m]))",
"legendFormat": "NOERROR"
},
{
"expr": "sum(rate(bind_responses_total{result=\"NXDOMAIN\"}[5m]))",
"legendFormat": "NXDOMAIN"
},
{
"expr": "sum(rate(bind_responses_total{result=\"SERVFAIL\"}[5m]))",
"legendFormat": "SERVFAIL"
}
]
}
Cache Hit Rate
{
"title": "Cache Hit Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(bind_resolver_cache_hits_total) / (sum(bind_resolver_cache_hits_total) + sum(bind_resolver_cache_misses_total)) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"value": 0, "color": "red"},
{"value": 70, "color": "yellow"},
{"value": 90, "color": "green"}
]
}
}
}
}
Alerting Rules
Prometheus Alert Rules
# /etc/prometheus/rules/bind.yml
groups:
- name: bind
rules:
# High query rate
- alert: BindHighQueryRate
expr: sum(rate(bind_incoming_queries_total[5m])) by (instance) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High DNS query rate on {{ $labels.instance }}"
description: "Query rate is {{ $value }} queries/sec"
# SERVFAIL rate too high
- alert: BindHighServfailRate
expr: |
sum(rate(bind_responses_total{result="SERVFAIL"}[5m])) by (instance) /
sum(rate(bind_responses_total[5m])) by (instance) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High SERVFAIL rate on {{ $labels.instance }}"
description: "SERVFAIL rate is {{ $value | humanizePercentage }}"
# Low cache hit rate
- alert: BindLowCacheHitRate
expr: |
sum(bind_resolver_cache_hits_total) by (instance) /
(sum(bind_resolver_cache_hits_total) by (instance) +
sum(bind_resolver_cache_misses_total) by (instance)) < 0.7
for: 15m
labels:
severity: warning
annotations:
summary: "Low cache hit rate on {{ $labels.instance }}"
description: "Cache hit rate is {{ $value | humanizePercentage }}"
# Recursive clients exhausted
- alert: BindRecursiveClientsHigh
expr: bind_recursive_clients > 900
for: 2m
labels:
severity: critical
annotations:
summary: "High recursive clients on {{ $labels.instance }}"
description: "Recursive clients: {{ $value }} (default max: 1000)"
# Zone transfer failures
- alert: BindZoneTransferFailure
expr: increase(bind_zone_transfer_failure_total[1h]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Zone transfer failure on {{ $labels.instance }}"
description: "Zone transfers have failed in the last hour"
# Exporter down
- alert: BindExporterDown
expr: up{job="bind"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "BIND exporter down on {{ $labels.instance }}"
Query Logging for Analysis
For deeper analysis, enable query logging and parse logs.
Enable Query Logging
// /etc/bind/named.conf
logging {
channel query_log {
file "/var/log/named/queries.log" versions 10 size 100m;
severity info;
print-time yes;
print-category yes;
print-severity yes;
};
channel security_log {
file "/var/log/named/security.log" versions 5 size 50m;
severity info;
print-time yes;
print-category yes;
};
category queries { query_log; };
category security { security_log; };
category query-errors { query_log; };
};
Parse Logs with Promtail/Loki
# /etc/promtail/promtail.yml
server:
http_listen_port: 9080
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: bind_queries
static_configs:
- targets:
- localhost
labels:
job: bind_queries
__path__: /var/log/named/queries.log
pipeline_stages:
- regex:
expression: '^(?P<timestamp>\d+-\w+-\d+ \d+:\d+:\d+.\d+) queries: info: client @\S+ (?P<client_ip>[\d.]+)#\d+ \((?P<query>[^)]+)\): query: (?P<domain>\S+) IN (?P<type>\w+)'
- labels:
client_ip:
query:
domain:
type:
- timestamp:
source: timestamp
format: "02-Jan-2006 15:04:05.000"
- job_name: bind_security
static_configs:
- targets:
- localhost
labels:
job: bind_security
__path__: /var/log/named/security.log
Complete Monitoring Stack with Docker Compose
Deploy a complete monitoring stack:
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
volumes:
- ./alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
ports:
- "9093:9093"
networks:
- monitoring
bind_exporter:
image: prometheuscommunity/bind-exporter:latest
container_name: bind_exporter
command:
- '--bind.stats-url=http://host.docker.internal:8053/'
- '--bind.stats-groups=server,view,tasks'
ports:
- "9119:9119"
networks:
- monitoring
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridge
Security Monitoring
Detecting DNS Attacks
Create alerts for potential security issues:
# Security-focused alerts
groups:
- name: bind_security
rules:
# Possible DNS amplification attack
- alert: BindPossibleAmplificationAttack
expr: |
sum(rate(bind_incoming_queries_total{type="ANY"}[5m])) by (instance) > 100
for: 2m
labels:
severity: critical
annotations:
summary: "Possible DNS amplification attack on {{ $labels.instance }}"
description: "High rate of ANY queries detected"
# Unusual NXDOMAIN rate (possible enumeration)
- alert: BindHighNxdomainRate
expr: |
sum(rate(bind_responses_total{result="NXDOMAIN"}[5m])) by (instance) /
sum(rate(bind_responses_total[5m])) by (instance) > 0.3
for: 10m
labels:
severity: warning
annotations:
summary: "High NXDOMAIN rate on {{ $labels.instance }}"
description: "NXDOMAIN rate is {{ $value | humanizePercentage }}"
# Query flood detection
- alert: BindQueryFlood
expr: sum(rate(bind_incoming_queries_total[1m])) by (instance) > 5000
for: 1m
labels:
severity: critical
annotations:
summary: "DNS query flood on {{ $labels.instance }}"
description: "Query rate: {{ $value }} q/s"
Useful PromQL Queries
Performance Analysis
# Top queried domains (requires log parsing)
topk(10, sum by (domain) (increase(bind_queries_by_domain[1h])))
# Average query latency
histogram_quantile(0.95, sum(rate(bind_resolver_response_time_seconds_bucket[5m])) by (le))
# Queries by protocol (UDP vs TCP)
sum(rate(bind_incoming_queries_total[5m])) by (protocol)
# DNSSEC validation failures
increase(bind_dnssec_validation_failures_total[1h])
Capacity Planning
# Peak query rate (last 7 days)
max_over_time(sum(rate(bind_incoming_queries_total[5m]))[7d:5m])
# Average queries per day
sum(increase(bind_incoming_queries_total[24h]))
# Cache size growth
rate(bind_resolver_cache_size_bytes[1h])
Best Practices
- Monitor all DNS servers - Deploy bind_exporter on every BIND instance
- Set appropriate scrape intervals - 15-30 seconds is usually sufficient
- Retain metrics appropriately - Keep high-resolution data for 15 days, aggregated data longer
- Alert on trends, not spikes - Use
forduration to avoid alert fatigue - Correlate with other metrics - Link DNS performance to application metrics
- Document your dashboards - Add descriptions to panels explaining what they show
- Test your alerts - Periodically verify alerts fire correctly
Conclusion
Monitoring BIND with Prometheus and Grafana provides comprehensive visibility into your DNS infrastructure. The bind_exporter makes it easy to collect metrics, while Prometheus alerting helps you respond to issues before they impact users.
This completes our BIND DNS series covering:
- Resolver configuration
- Authoritative server setup
- RNDC and remote management
- Zone security and ACLs
- Dynamic DNS updates
- External-DNS integration
- Response Policy Zones
- DNSSEC validation
- DNSSEC zone signing
- DNS over TLS
- DNS over HTTPS
- Prometheus monitoring
With these building blocks, you can build a robust, secure, and observable DNS infrastructure for any environment - from homelab to enterprise.