Deployment & Infrastructure

How to Configure Metabase for High Availability

High availability for Metabase means the application continues serving requests during infrastructure failures — including individual component failur...

📅
📖8 min read

How to Configure Metabase for High Availability

High availability for Metabase means the application continues serving requests during infrastructure failures — including individual component failures, maintenance windows, and software upgrades — without user-visible downtime. Because Metabase is a stateful application that does not support active-active horizontal scaling, high availability is achieved through resilient infrastructure components (load balancers, managed databases, container orchestration) rather than running multiple active Metabase instances simultaneously.

---

What High Availability Means for Metabase

For most web applications, HA means running multiple application replicas behind a load balancer. Metabase's architecture makes this approach problematic: multiple simultaneous Metabase instances sharing a single application database will conflict on scheduled tasks, cache warming, and other singleton operations.

Instead, Metabase HA focuses on:

  • Zero-downtime deployments — new versions are deployed without taking down the running instance
  • Fast automated recovery — if the container crashes, it's restarted automatically within seconds
  • Database-level HA — the application database (PostgreSQL) runs with replication and automatic failover
  • Infrastructure redundancy — load balancers, NAT gateways, and networking are redundant across availability zones
  • The expected SLA for a properly configured Metabase deployment is 99.9%+ uptime, which allows for ~8.7 hours of downtime per year — largely from planned deployments and brief recovery windows after failures.

    ---

    Component-by-Component HA Configuration

    1. Container Orchestration (ECS / Kubernetes)

    Container orchestration handles automatic container restart on failure — this is the most important HA mechanism for Metabase.

    ECS Fargate:

    hcl
    

    resource "aws_ecs_service" "metabase" { desired_count = 1

    # Ensure a new healthy task starts before the old one stops deployment_minimum_healthy_percent = 100 deployment_maximum_percent = 200

    # Restart on task failure # ECS automatically restarts tasks that exit unexpectedly }

    Kubernetes:

    yaml
    

    spec: replicas: 1 strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 0 # never stop old pod before new one is ready maxSurge: 1

    template: spec: restartPolicy: Always # default — always restart on failure

    With these settings:

  • If the Metabase container crashes, the orchestrator restarts it automatically (typically within 30 seconds)
  • Deployments roll forward without taking down the existing instance first
  • If the new version fails health checks, the old version continues running
  • 2. Application Database HA (PostgreSQL)

    The application database is the single most critical component for Metabase availability. If the database is unavailable, Metabase cannot serve requests.

    AWS RDS Multi-AZ:

    hcl
    

    resource "aws_db_instance" "metabase" { multi_az = true # enables synchronous replication to a standby backup_retention_period = 7 deletion_protection = true }

    With Multi-AZ enabled, RDS maintains a synchronous standby replica in a different availability zone. Failover to the standby is automatic and typically completes in 60–120 seconds. During failover, Metabase will be briefly unavailable while reconnecting.

    Connection resilience in Metabase:

    Metabase automatically reconnects to the database after a failover. The JVM-level PostgreSQL driver handles connection retries. No additional configuration is required — Metabase will recover on its own after an RDS failover.

    Google Cloud SQL (High Availability):

    yaml
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">In Cloud SQL instance settings</h1> availability_type: REGIONAL # multi-zone HA backup_configuration: enabled: true point_in_time_recovery_enabled: true

    3. Load Balancer Redundancy

    AWS Application Load Balancers and GCP Load Balancers are inherently highly available — they run across multiple availability zones and have no single point of failure. No additional configuration is needed; using a managed load balancer rather than a self-managed nginx instance is itself an HA choice.

    If you're running nginx as a reverse proxy in front of Metabase (common in Kubernetes), configure it with multiple replicas:

    yaml
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">nginx-ingress or standalone nginx deployment</h1> replicas: 2

    4. Multi-AZ Compute

    For ECS, deploy the task in subnets across multiple availability zones. If one AZ fails, ECS can restart the task in another AZ:

    hcl
    

    resource "aws_ecs_service" "metabase" { network_configuration { subnets = [ aws_subnet.private_a.id, # us-east-1a aws_subnet.private_b.id, # us-east-1b ] } }

    For Kubernetes, ensure your node pool spans multiple AZs and set a pod topology spread constraint:

    yaml
    

    topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: metabase

    ---

    Health Checks

    Health checks are how the orchestrator and load balancer know Metabase is ready to serve traffic. Metabase exposes /api/health which returns HTTP 200 when the application is fully initialized.

    Correct health check configuration:

    yaml
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Kubernetes startupProbe — allows time for first start</h1> startupProbe: httpGet: path: /api/health port: 3000 failureThreshold: 30 periodSeconds: 20 # up to 10 minutes for first startup

    <h1 class="text-4xl font-bold mb-6 text-slate-900">readinessProbe — controls traffic routing</h1> readinessProbe: httpGet: path: /api/health port: 3000 periodSeconds: 10 failureThreshold: 3

    <h1 class="text-4xl font-bold mb-6 text-slate-900">livenessProbe — triggers container restart</h1> livenessProbe: httpGet: path: /api/health port: 3000 periodSeconds: 30 failureThreshold: 5 initialDelaySeconds: 120

    Critical: Do not use the same timeout for startup and liveness probes. Metabase's first startup takes 60–120 seconds (JVM initialization plus database migrations). A liveness probe that fires too early will kill the container before it finishes starting, causing an infinite restart loop.

    ---

    Graceful Shutdown

    When Kubernetes or ECS stops a Metabase container (for a deployment or scaling event), it sends SIGTERM and waits before sending SIGKILL. Configure a generous termination grace period to allow in-flight queries to complete:

    yaml
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Kubernetes</h1> spec: terminationGracePeriodSeconds: 60

    hcl
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">ECS task definition</h1> stop_timeout = 60 # seconds before SIGKILL

    Without this, active user queries are terminated mid-execution when a deployment happens.

    ---

    Automatic Recovery Scenarios

    FailureRecovery MechanismExpected Downtime
    Container crashECS/Kubernetes automatic restart30–60 seconds
    Application OOMContainer restart (OOMKilled)30–60 seconds
    RDS failover (Multi-AZ)Automatic DNS failover + reconnect60–120 seconds
    AZ failure (ECS multi-AZ)Task rescheduled to healthy AZ90–180 seconds
    Metabase upgradeRolling deployment0 (new task up before old stops)
    Host node failure (K8s)Pod rescheduled to healthy node60–180 seconds
    ---

    Metabase Cloud HA

    If you're using Metabase Cloud, Anthropic manages all HA infrastructure. Metabase Cloud:

  • Runs across multiple availability zones
  • Uses managed PostgreSQL with automatic failover
  • Handles upgrades with zero-downtime deployments
  • Provides a 99.9% uptime SLA
  • For teams where HA infrastructure management is a burden, Metabase Cloud is the simpler path to high availability.

    ---

    Monitoring for Availability

    Configure alerting for HA-relevant events:

    Critical alerts (page immediately):

  • /api/health returning non-200 for > 2 minutes
  • RDS database unavailable
  • ECS service desired count != running count
  • Load balancer reporting 0 healthy targets
  • Warning alerts (notify, investigate within business hours):

  • Container restart count > 2 in 1 hour (crash loop early warning)
  • RDS failover occurred
  • Memory utilization > 85% (approaching OOM)
  • CPU utilization > 80% sustained for 15 minutes
  • hcl
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Terraform: CloudWatch alarm for zero healthy targets</h1> resource "aws_cloudwatch_metric_alarm" "no_healthy_hosts" { alarm_name = "metabase-no-healthy-hosts" metric_name = "HealthyHostCount" namespace = "AWS/ApplicationELB" statistic = "Minimum" period = 60 evaluation_periods = 2 threshold = 1 comparison_operator = "LessThanThreshold" alarm_actions = [aws_sns_topic.pagerduty.arn]

    dimensions = { LoadBalancer = aws_lb.metabase.arn_suffix TargetGroup = aws_lb_target_group.metabase.arn_suffix } }

    ---

    Runbook: Responding to Metabase Outages

    Container Not Starting

    bash
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Check container logs</h1> kubectl logs -n metabase deployment/metabase --previous <h1 class="text-4xl font-bold mb-6 text-slate-900">or</h1> aws logs tail /ecs/metabase --since 30m

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Common causes:</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">- Can't reach application DB (check security groups / network policy)</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">- Wrong DB credentials (check Secrets Manager)</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">- Insufficient memory (check OOMKilled events)</h1>

    Metabase Responding Slowly

    bash
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Check if it's a query issue (slow database) vs application issue</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">Look for slow query indicators in Metabase logs</h1>

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Check JVM memory pressure</h1> kubectl exec -n metabase deployment/metabase -- \ java -jar /app/metabase.jar health

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Consider enabling query caching for frequently-accessed dashboards</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">Admin → Performance → Cache</h1>

    Application Database Failover

    bash
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">After RDS Multi-AZ failover, Metabase reconnects automatically</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">Monitor for reconnection: watch logs for "Connected to database"</h1>

    <h1 class="text-4xl font-bold mb-6 text-slate-900">If Metabase doesn't reconnect within 5 minutes, restart the container:</h1> kubectl rollout restart deployment/metabase -n metabase

    ---

    Summary

    High availability for Metabase is achieved through infrastructure resilience rather than application-level horizontal scaling. The critical components are: container orchestration with maxUnavailable: 0 for zero-downtime deployments and automatic container restart, Multi-AZ managed PostgreSQL for the application database with automatic failover, a managed load balancer across multiple availability zones, and correctly configured health checks that account for Metabase's 60–120 second startup time. With these in place, a single-replica Metabase deployment can achieve 99.9%+ availability.