How to Configure Metabase for High Availability

High availability for Metabase means the application continues serving requests during infrastructure failures — including individual component failures, maintenance windows, and software upgrades — without user-visible downtime. Because Metabase is a stateful application that does not support active-active horizontal scaling, high availability is achieved through resilient infrastructure components (load balancers, managed databases, container orchestration) rather than running multiple active Metabase instances simultaneously.

---

What High Availability Means for Metabase

For most web applications, HA means running multiple application replicas behind a load balancer. Metabase's architecture makes this approach problematic: multiple simultaneous Metabase instances sharing a single application database will conflict on scheduled tasks, cache warming, and other singleton operations.

Instead, Metabase HA focuses on:

Zero-downtime deployments — new versions are deployed without taking down the running instance

Fast automated recovery — if the container crashes, it's restarted automatically within seconds

Database-level HA — the application database (PostgreSQL) runs with replication and automatic failover

Infrastructure redundancy — load balancers, NAT gateways, and networking are redundant across availability zones

The expected SLA for a properly configured Metabase deployment is 99.9%+ uptime, which allows for ~8.7 hours of downtime per year — largely from planned deployments and brief recovery windows after failures.

---

Component-by-Component HA Configuration

1. Container Orchestration (ECS / Kubernetes)

Container orchestration handles automatic container restart on failure — this is the most important HA mechanism for Metabase.

ECS Fargate:

hcl resource "aws_ecs_service" "metabase" { desired_count = 1 # Ensure a new healthy task starts before the old one stops deployment_minimum_healthy_percent = 100 deployment_maximum_percent = 200

# Restart on task failure # ECS automatically restarts tasks that exit unexpectedly }

Kubernetes:

yaml spec: replicas: 1 strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 0 # never stop old pod before new one is ready maxSurge: 1

template: spec: restartPolicy: Always # default — always restart on failure

With these settings:

If the Metabase container crashes, the orchestrator restarts it automatically (typically within 30 seconds)

Deployments roll forward without taking down the existing instance first

If the new version fails health checks, the old version continues running

2. Application Database HA (PostgreSQL)

The application database is the single most critical component for Metabase availability. If the database is unavailable, Metabase cannot serve requests.

AWS RDS Multi-AZ:

hcl
resource "aws_db_instance" "metabase" { multi_az               = true  # enables synchronous replication to a standby backup_retention_period = 7 deletion_protection    = true }

With Multi-AZ enabled, RDS maintains a synchronous standby replica in a different availability zone. Failover to the standby is automatic and typically completes in 60–120 seconds. During failover, Metabase will be briefly unavailable while reconnecting.

Connection resilience in Metabase:

Metabase automatically reconnects to the database after a failover. The JVM-level PostgreSQL driver handles connection retries. No additional configuration is required — Metabase will recover on its own after an RDS failover.

Google Cloud SQL (High Availability):

yaml

<h1 class="text-4xl font-bold mb-6 text-slate-900">In Cloud SQL instance settings</h1> availability_type: REGIONAL # multi-zone HA backup_configuration: enabled: true point_in_time_recovery_enabled: true

3. Load Balancer Redundancy

AWS Application Load Balancers and GCP Load Balancers are inherently highly available — they run across multiple availability zones and have no single point of failure. No additional configuration is needed; using a managed load balancer rather than a self-managed nginx instance is itself an HA choice.

If you're running nginx as a reverse proxy in front of Metabase (common in Kubernetes), configure it with multiple replicas:

yaml
<h1 class="text-4xl font-bold mb-6 text-slate-900">nginx-ingress or standalone nginx deployment</h1> replicas: 2

4. Multi-AZ Compute

For ECS, deploy the task in subnets across multiple availability zones. If one AZ fails, ECS can restart the task in another AZ:

hcl
resource "aws_ecs_service" "metabase" { network_configuration { subnets = [ aws_subnet.private_a.id,  # us-east-1a aws_subnet.private_b.id,  # us-east-1b ] } }

For Kubernetes, ensure your node pool spans multiple AZs and set a pod topology spread constraint:

yaml

topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: metabase

---

Health Checks

Health checks are how the orchestrator and load balancer know Metabase is ready to serve traffic. Metabase exposes /api/health which returns HTTP 200 when the application is fully initialized.

Correct health check configuration:

yaml <h1 class="text-4xl font-bold mb-6 text-slate-900">Kubernetes startupProbe — allows time for first start</h1> startupProbe: httpGet: path: /api/health port: 3000 failureThreshold: 30 periodSeconds: 20 # up to 10 minutes for first startup <h1 class="text-4xl font-bold mb-6 text-slate-900">readinessProbe — controls traffic routing</h1> readinessProbe: httpGet: path: /api/health port: 3000 periodSeconds: 10 failureThreshold: 3

<h1 class="text-4xl font-bold mb-6 text-slate-900">livenessProbe — triggers container restart</h1> livenessProbe: httpGet: path: /api/health port: 3000 periodSeconds: 30 failureThreshold: 5 initialDelaySeconds: 120

Critical: Do not use the same timeout for startup and liveness probes. Metabase's first startup takes 60–120 seconds (JVM initialization plus database migrations). A liveness probe that fires too early will kill the container before it finishes starting, causing an infinite restart loop.

---

Graceful Shutdown

When Kubernetes or ECS stops a Metabase container (for a deployment or scaling event), it sends SIGTERM and waits before sending SIGKILL. Configure a generous termination grace period to allow in-flight queries to complete:

yaml
<h1 class="text-4xl font-bold mb-6 text-slate-900">Kubernetes</h1> spec: terminationGracePeriodSeconds: 60

hcl
<h1 class="text-4xl font-bold mb-6 text-slate-900">ECS task definition</h1> stop_timeout = 60  # seconds before SIGKILL

Without this, active user queries are terminated mid-execution when a deployment happens.

---

Automatic Recovery Scenarios

Failure	Recovery Mechanism	Expected Downtime
Container crash	ECS/Kubernetes automatic restart	30–60 seconds
Application OOM	Container restart (OOMKilled)	30–60 seconds
RDS failover (Multi-AZ)	Automatic DNS failover + reconnect	60–120 seconds
AZ failure (ECS multi-AZ)	Task rescheduled to healthy AZ	90–180 seconds
Metabase upgrade	Rolling deployment	0 (new task up before old stops)
Host node failure (K8s)	Pod rescheduled to healthy node	60–180 seconds

---

Metabase Cloud HA

If you're using Metabase Cloud, Anthropic manages all HA infrastructure. Metabase Cloud:

Runs across multiple availability zones

Uses managed PostgreSQL with automatic failover

Handles upgrades with zero-downtime deployments

Provides a 99.9% uptime SLA

For teams where HA infrastructure management is a burden, Metabase Cloud is the simpler path to high availability.

---

Monitoring for Availability

Configure alerting for HA-relevant events:

Critical alerts (page immediately):

/api/health returning non-200 for > 2 minutes

RDS database unavailable

ECS service desired count != running count

Load balancer reporting 0 healthy targets

Warning alerts (notify, investigate within business hours):

Container restart count > 2 in 1 hour (crash loop early warning)

RDS failover occurred

Memory utilization > 85% (approaching OOM)

CPU utilization > 80% sustained for 15 minutes

hcl
<h1 class="text-4xl font-bold mb-6 text-slate-900">Terraform: CloudWatch alarm for zero healthy targets</h1> resource "aws_cloudwatch_metric_alarm" "no_healthy_hosts" { alarm_name          = "metabase-no-healthy-hosts" metric_name         = "HealthyHostCount" namespace           = "AWS/ApplicationELB" statistic           = "Minimum" period              = 60 evaluation_periods  = 2 threshold           = 1 comparison_operator = "LessThanThreshold" alarm_actions       = [aws_sns_topic.pagerduty.arn] dimensions = { LoadBalancer = aws_lb.metabase.arn_suffix TargetGroup  = aws_lb_target_group.metabase.arn_suffix } }

---

Runbook: Responding to Metabase Outages

Container Not Starting

bash
<h1 class="text-4xl font-bold mb-6 text-slate-900">Check container logs</h1> kubectl logs -n metabase deployment/metabase --previous <h1 class="text-4xl font-bold mb-6 text-slate-900">or</h1> aws logs tail /ecs/metabase --since 30m <h1 class="text-4xl font-bold mb-6 text-slate-900">Common causes:</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">- Can't reach application DB (check security groups / network policy)</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">- Wrong DB credentials (check Secrets Manager)</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">- Insufficient memory (check OOMKilled events)</h1>

Metabase Responding Slowly

bash
<h1 class="text-4xl font-bold mb-6 text-slate-900">Check if it's a query issue (slow database) vs application issue</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">Look for slow query indicators in Metabase logs</h1> 
<h1 class="text-4xl font-bold mb-6 text-slate-900">Check JVM memory pressure</h1> kubectl exec -n metabase deployment/metabase -- \ java -jar /app/metabase.jar health <h1 class="text-4xl font-bold mb-6 text-slate-900">Consider enabling query caching for frequently-accessed dashboards</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">Admin → Performance → Cache</h1>

Application Database Failover

bash
<h1 class="text-4xl font-bold mb-6 text-slate-900">After RDS Multi-AZ failover, Metabase reconnects automatically</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">Monitor for reconnection: watch logs for "Connected to database"</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">If Metabase doesn't reconnect within 5 minutes, restart the container:</h1> kubectl rollout restart deployment/metabase -n metabase

---

Summary

High availability for Metabase is achieved through infrastructure resilience rather than application-level horizontal scaling. The critical components are: container orchestration with maxUnavailable: 0 for zero-downtime deployments and automatic container restart, Multi-AZ managed PostgreSQL for the application database with automatic failover, a managed load balancer across multiple availability zones, and correctly configured health checks that account for Metabase's 60–120 second startup time. With these in place, a single-replica Metabase deployment can achieve 99.9%+ availability.