Monitoring BIND DNS with Prometheus and Grafana

Monitoring BIND DNS with Prometheus and Grafana

A well-monitored DNS infrastructure provides visibility into query patterns, performance bottlenecks, and potential security issues. This final post in the BIND series covers setting up comprehensive monitoring using Prometheus and Grafana, giving you dashboards and alerts for your DNS servers.

BIND Statistics Overview

BIND provides extensive statistics through several channels:

  • Statistics channel - XML/JSON HTTP endpoint
  • Query logging - Detailed query/response logs
  • RNDC statistics - Command-line statistics dump

For Prometheus integration, we'll use the bind_exporter to scrape the statistics channel.

Enabling BIND Statistics Channel

Configure Statistics in BIND

// /etc/bind/named.conf.options

options {
    directory "/var/cache/bind";
    
    // Enable statistics channel
    statistics-channels {
        inet 127.0.0.1 port 8053 allow { localhost; };
        inet ::1 port 8053 allow { localhost; };
    };
    
    // Enable zone statistics
    zone-statistics full;
    
    // Other options...
    recursion yes;
    dnssec-validation auto;
};

Verify Statistics Channel

# Restart BIND
systemctl restart named

# Test statistics endpoint (XML)
curl http://127.0.0.1:8053/

# Get XML statistics
curl http://127.0.0.1:8053/xml/v3/server

# Get JSON statistics (BIND 9.10+)
curl http://127.0.0.1:8053/json/v1/server

Installing bind_exporter

The bind_exporter converts BIND statistics to Prometheus metrics format.

Download and Install

# Download latest release
wget https://github.com/prometheus-community/bind_exporter/releases/download/v0.7.0/bind_exporter-0.7.0.linux-amd64.tar.gz

# Extract
tar xzf bind_exporter-0.7.0.linux-amd64.tar.gz
mv bind_exporter-0.7.0.linux-amd64/bind_exporter /usr/local/bin/

# Set permissions
chmod +x /usr/local/bin/bind_exporter

Create Systemd Service

# /etc/systemd/system/bind_exporter.service

[Unit]
Description=BIND Exporter for Prometheus
After=network.target named.service
Wants=named.service

[Service]
Type=simple
User=nobody
Group=nogroup
ExecStart=/usr/local/bin/bind_exporter \
    --bind.stats-url=http://127.0.0.1:8053/ \
    --bind.stats-groups=server,view,tasks \
    --web.listen-address=:9119
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable bind_exporter
systemctl start bind_exporter

# Test metrics endpoint
curl http://localhost:9119/metrics

Available Statistics Groups

The exporter supports these statistics groups:

Group Description
server Overall server statistics
view Per-view statistics
tasks Task manager statistics

Prometheus Configuration

Add Scrape Target

# /etc/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'bind'
    static_configs:
      - targets:
          - 'dns1.example.com:9119'
          - 'dns2.example.com:9119'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+).*'
        replacement: '${1}'

  # For multiple BIND servers with single exporter
  - job_name: 'bind-multi'
    static_configs:
      - targets:
          - 'dns1.example.com'
          - 'dns2.example.com'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 'prometheus-server:9119'

Reload Prometheus

kill -HUP $(pidof prometheus)
# Or
systemctl reload prometheus

Key Metrics to Monitor

Query Metrics

# Total queries received
bind_incoming_queries_total

# Queries by type (A, AAAA, MX, etc.)
bind_incoming_queries_total{type="A"}

# Query rate (queries per second)
rate(bind_incoming_queries_total[5m])

# Queries by result code
bind_responses_total{result="NOERROR"}
bind_responses_total{result="NXDOMAIN"}
bind_responses_total{result="SERVFAIL"}

Resolver Metrics

# Cache hit rate
bind_resolver_cache_hits_total / 
(bind_resolver_cache_hits_total + bind_resolver_cache_misses_total)

# Recursive queries sent
rate(bind_resolver_queries_total[5m])

# Response time distribution
bind_resolver_response_time_seconds_bucket

Server Health Metrics

# Current number of recursive clients
bind_recursive_clients

# TCP connections
bind_tcp_connections_total

# Zone transfers
bind_zone_transfer_success_total
bind_zone_transfer_failure_total

Grafana Dashboard

Import Pre-built Dashboard

The BIND Exporter community provides dashboards:

  1. In Grafana, go to Dashboards -> Import
  2. Enter dashboard ID: 12309 (BIND Exporter)
  3. Select your Prometheus data source
  4. Click Import

Custom Dashboard Panels

Query Rate Panel

{
  "title": "DNS Query Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(bind_incoming_queries_total[5m])) by (instance)",
      "legendFormat": "{{instance}}"
    }
  ],
  "yAxes": [
    {
      "label": "Queries/sec",
      "format": "short"
    }
  ]
}

Query Type Distribution

{
  "title": "Query Types",
  "type": "piechart",
  "targets": [
    {
      "expr": "sum(increase(bind_incoming_queries_total[1h])) by (type)",
      "legendFormat": "{{type}}"
    }
  ]
}

Response Codes

{
  "title": "Response Codes",
  "type": "stat",
  "targets": [
    {
      "expr": "sum(rate(bind_responses_total{result=\"NOERROR\"}[5m]))",
      "legendFormat": "NOERROR"
    },
    {
      "expr": "sum(rate(bind_responses_total{result=\"NXDOMAIN\"}[5m]))",
      "legendFormat": "NXDOMAIN"
    },
    {
      "expr": "sum(rate(bind_responses_total{result=\"SERVFAIL\"}[5m]))",
      "legendFormat": "SERVFAIL"
    }
  ]
}

Cache Hit Rate

{
  "title": "Cache Hit Rate",
  "type": "gauge",
  "targets": [
    {
      "expr": "sum(bind_resolver_cache_hits_total) / (sum(bind_resolver_cache_hits_total) + sum(bind_resolver_cache_misses_total)) * 100"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "min": 0,
      "max": 100,
      "thresholds": {
        "steps": [
          {"value": 0, "color": "red"},
          {"value": 70, "color": "yellow"},
          {"value": 90, "color": "green"}
        ]
      }
    }
  }
}

Alerting Rules

Prometheus Alert Rules

# /etc/prometheus/rules/bind.yml

groups:
  - name: bind
    rules:
      # High query rate
      - alert: BindHighQueryRate
        expr: sum(rate(bind_incoming_queries_total[5m])) by (instance) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High DNS query rate on {{ $labels.instance }}"
          description: "Query rate is {{ $value }} queries/sec"

      # SERVFAIL rate too high
      - alert: BindHighServfailRate
        expr: |
          sum(rate(bind_responses_total{result="SERVFAIL"}[5m])) by (instance) /
          sum(rate(bind_responses_total[5m])) by (instance) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High SERVFAIL rate on {{ $labels.instance }}"
          description: "SERVFAIL rate is {{ $value | humanizePercentage }}"

      # Low cache hit rate
      - alert: BindLowCacheHitRate
        expr: |
          sum(bind_resolver_cache_hits_total) by (instance) /
          (sum(bind_resolver_cache_hits_total) by (instance) + 
           sum(bind_resolver_cache_misses_total) by (instance)) < 0.7
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate on {{ $labels.instance }}"
          description: "Cache hit rate is {{ $value | humanizePercentage }}"

      # Recursive clients exhausted
      - alert: BindRecursiveClientsHigh
        expr: bind_recursive_clients > 900
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High recursive clients on {{ $labels.instance }}"
          description: "Recursive clients: {{ $value }} (default max: 1000)"

      # Zone transfer failures
      - alert: BindZoneTransferFailure
        expr: increase(bind_zone_transfer_failure_total[1h]) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Zone transfer failure on {{ $labels.instance }}"
          description: "Zone transfers have failed in the last hour"

      # Exporter down
      - alert: BindExporterDown
        expr: up{job="bind"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "BIND exporter down on {{ $labels.instance }}"

Query Logging for Analysis

For deeper analysis, enable query logging and parse logs.

Enable Query Logging

// /etc/bind/named.conf

logging {
    channel query_log {
        file "/var/log/named/queries.log" versions 10 size 100m;
        severity info;
        print-time yes;
        print-category yes;
        print-severity yes;
    };
    
    channel security_log {
        file "/var/log/named/security.log" versions 5 size 50m;
        severity info;
        print-time yes;
        print-category yes;
    };
    
    category queries { query_log; };
    category security { security_log; };
    category query-errors { query_log; };
};

Parse Logs with Promtail/Loki

# /etc/promtail/promtail.yml

server:
  http_listen_port: 9080

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: bind_queries
    static_configs:
      - targets:
          - localhost
        labels:
          job: bind_queries
          __path__: /var/log/named/queries.log
    pipeline_stages:
      - regex:
          expression: '^(?P<timestamp>\d+-\w+-\d+ \d+:\d+:\d+.\d+) queries: info: client @\S+ (?P<client_ip>[\d.]+)#\d+ \((?P<query>[^)]+)\): query: (?P<domain>\S+) IN (?P<type>\w+)'
      - labels:
          client_ip:
          query:
          domain:
          type:
      - timestamp:
          source: timestamp
          format: "02-Jan-2006 15:04:05.000"

  - job_name: bind_security
    static_configs:
      - targets:
          - localhost
        labels:
          job: bind_security
          __path__: /var/log/named/security.log

Complete Monitoring Stack with Docker Compose

Deploy a complete monitoring stack:

# docker-compose.yml

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    ports:
      - "9093:9093"
    networks:
      - monitoring

  bind_exporter:
    image: prometheuscommunity/bind-exporter:latest
    container_name: bind_exporter
    command:
      - '--bind.stats-url=http://host.docker.internal:8053/'
      - '--bind.stats-groups=server,view,tasks'
    ports:
      - "9119:9119"
    networks:
      - monitoring
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  prometheus_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge

Security Monitoring

Detecting DNS Attacks

Create alerts for potential security issues:

# Security-focused alerts
groups:
  - name: bind_security
    rules:
      # Possible DNS amplification attack
      - alert: BindPossibleAmplificationAttack
        expr: |
          sum(rate(bind_incoming_queries_total{type="ANY"}[5m])) by (instance) > 100
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Possible DNS amplification attack on {{ $labels.instance }}"
          description: "High rate of ANY queries detected"

      # Unusual NXDOMAIN rate (possible enumeration)
      - alert: BindHighNxdomainRate
        expr: |
          sum(rate(bind_responses_total{result="NXDOMAIN"}[5m])) by (instance) /
          sum(rate(bind_responses_total[5m])) by (instance) > 0.3
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High NXDOMAIN rate on {{ $labels.instance }}"
          description: "NXDOMAIN rate is {{ $value | humanizePercentage }}"

      # Query flood detection
      - alert: BindQueryFlood
        expr: sum(rate(bind_incoming_queries_total[1m])) by (instance) > 5000
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "DNS query flood on {{ $labels.instance }}"
          description: "Query rate: {{ $value }} q/s"

Useful PromQL Queries

Performance Analysis

# Top queried domains (requires log parsing)
topk(10, sum by (domain) (increase(bind_queries_by_domain[1h])))

# Average query latency
histogram_quantile(0.95, sum(rate(bind_resolver_response_time_seconds_bucket[5m])) by (le))

# Queries by protocol (UDP vs TCP)
sum(rate(bind_incoming_queries_total[5m])) by (protocol)

# DNSSEC validation failures
increase(bind_dnssec_validation_failures_total[1h])

Capacity Planning

# Peak query rate (last 7 days)
max_over_time(sum(rate(bind_incoming_queries_total[5m]))[7d:5m])

# Average queries per day
sum(increase(bind_incoming_queries_total[24h]))

# Cache size growth
rate(bind_resolver_cache_size_bytes[1h])

Best Practices

  1. Monitor all DNS servers - Deploy bind_exporter on every BIND instance
  2. Set appropriate scrape intervals - 15-30 seconds is usually sufficient
  3. Retain metrics appropriately - Keep high-resolution data for 15 days, aggregated data longer
  4. Alert on trends, not spikes - Use for duration to avoid alert fatigue
  5. Correlate with other metrics - Link DNS performance to application metrics
  6. Document your dashboards - Add descriptions to panels explaining what they show
  7. Test your alerts - Periodically verify alerts fire correctly

Conclusion

Monitoring BIND with Prometheus and Grafana provides comprehensive visibility into your DNS infrastructure. The bind_exporter makes it easy to collect metrics, while Prometheus alerting helps you respond to issues before they impact users.

This completes our BIND DNS series covering:

  1. Resolver configuration
  2. Authoritative server setup
  3. RNDC and remote management
  4. Zone security and ACLs
  5. Dynamic DNS updates
  6. External-DNS integration
  7. Response Policy Zones
  8. DNSSEC validation
  9. DNSSEC zone signing
  10. DNS over TLS
  11. DNS over HTTPS
  12. Prometheus monitoring

With these building blocks, you can build a robust, secure, and observable DNS infrastructure for any environment - from homelab to enterprise.

Read more

HAProxy Monitoring with Prometheus: Complete Observability Guide

HAProxy Monitoring with Prometheus: Complete Observability Guide

Monitoring HAProxy is essential for maintaining reliable load balancing infrastructure. Prometheus provides powerful metrics collection, alerting capabilities, and seamless Grafana integration for visualizing HAProxy performance and health. Why Prometheus for HAProxy? Prometheus offers: * Pull-based metrics - Prometheus scrapes HAProxy metrics endpoints * Time-series database - Store historical data for trend analysis

By Patrick de Ruiter