Performance & Scale

Scaling Metabase to 1,000+ Users

Scaling Metabase to 1,000 or more concurrent users is primarily a database scaling problem, not a Metabase application scaling problem. A single well-...

📅
📖8 min read

Scaling Metabase to 1,000+ Users

Scaling Metabase to 1,000 or more concurrent users is primarily a database scaling problem, not a Metabase application scaling problem. A single well-resourced Metabase instance can serve thousands of users — the bottleneck is almost always the underlying analytics database being queried, the application database storing Metabase's configuration, or the network between them. This guide covers what actually limits Metabase at scale, and the specific interventions that extend capacity without unnecessary complexity.

---

What "Scaling" Means for Metabase

Before optimizing, define what you're scaling:

Concurrent dashboard viewers — users loading dashboards simultaneously. Each dashboard load triggers multiple queries against your analytics database. 1,000 concurrent users loading a 10-question dashboard generates ~10,000 simultaneous queries.

Concurrent query authors — analysts running ad-hoc queries in the query builder or SQL editor. These are typically longer-running and more resource-intensive per query than dashboard loads.

Embedded analytics users — customers viewing embedded dashboards in your SaaS product. Often the highest-volume user type, but their queries are usually pre-defined and cacheable.

The intervention for each scenario differs. Most "1,000+ user" deployments are dominated by embedded analytics viewers, where caching and database optimization are the primary levers.

---

Capacity Limits and Bottlenecks

The Metabase Application Process

Metabase is a JVM application with an internal thread pool for query execution. The defaults handle modest concurrency well. At high scale, the relevant limits are:

Connection pool size: Metabase maintains a pool of connections to each connected database. The default pool size is typically 5–15 connections per database. At 1,000 concurrent users, this becomes a bottleneck — 1,000 users compete for 15 connections.

JVM heap: At very high concurrency, result sets held in memory can exhaust the JVM heap. Increase with JAVA_OPTS="-Xmx4g" for large deployments.

Thread pool: Metabase's query execution thread pool limits true parallelism. Most users experience this as query queue latency rather than errors.

The Application Database (PostgreSQL)

The Metabase application database stores sessions, cached results, and configuration. At scale, it receives:

  • A read for every dashboard load (checking permissions, loading questions)
  • A write for every query execution (logging query history)
  • Cache reads and writes for every cacheable question
  • Configure the application database for the increased load:

    sql
    

    -- Increase max connections for the application DB -- In RDS parameter group or postgresql.conf: max_connections = 200 -- up from default 100

    -- Tune shared buffers shared_buffers = 4GB -- 25% of available RAM

    -- Tune for connection bursts effective_cache_size = 12GB -- total RAM available to PostgreSQL

    The Analytics Database

    This is almost always the primary bottleneck. 1,000 users querying Snowflake, Redshift, or PostgreSQL simultaneously creates significant load. See the query performance and caching guides for database-level optimization.

    ---

    Scaling Strategies

    Strategy 1: Aggressive Caching (Highest Leverage)

    For embedded analytics with pre-defined dashboards, caching eliminates the majority of database queries. If 90% of dashboard loads can be served from cache:

  • 1,000 concurrent users generate ~100 database queries (not 10,000)
  • Database load stays within normal operational parameters
  • Dashboard load time drops from seconds to milliseconds
  • Configuration:

    bash
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Redis for high-throughput caching</h1> MB_REDIS_HOST=your-redis-host MB_REDIS_PORT=6379

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Reduce minimum query duration to cache more queries</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">Admin → Performance → Minimum query duration: 1 second</h1>

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Set appropriate TTLs per dashboard type</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">Customer portal: 30-60 minutes</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">Internal metrics: 5-15 minutes</h1>

    Cache warming at scale:

    javascript
    

    // Warm cache for top N tenants by activity async function warmTopTenants(limit = 100) { const topTenants = await db.query( SELECT organization_id, COUNT(*) as dashboard_views FROM analytics_events WHERE event = 'dashboard_view' AND created_at > NOW() - INTERVAL '7 days' GROUP BY organization_id ORDER BY dashboard_views DESC LIMIT $1 , [limit]);

    for (const tenant of topTenants.rows) { await warmTenantCache(tenant.organization_id); await delay(100); // avoid overwhelming Metabase } }

    Strategy 2: Vertical Scaling of the Metabase Container

    Before adding complexity with horizontal scaling or additional infrastructure, scale the single Metabase instance vertically. The JVM benefits significantly from more memory, and more CPU means less query queuing.

    User countRecommended Metabase resources
    < 100 concurrent2 vCPU, 3GB RAM
    100–500 concurrent4 vCPU, 8GB RAM
    500–2,000 concurrent8 vCPU, 16GB RAM
    2,000+ concurrentEvaluate horizontal scaling or Metabase Cloud

    bash
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">ECS Fargate task definition for large deployment</h1> cpu: 4096 # 4 vCPU memory: 16384 # 16 GB

    JAVA_OPTS: "-Xmx12g -Xms2g" # 12GB heap, headroom for JVM overhead

    Strategy 3: Read Replicas for Analytics Databases

    Direct Metabase to a read replica to isolate analytics load from application write traffic:

    bash
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">Point Metabase at the read replica</h1> <h1 class="text-4xl font-bold mb-6 text-slate-900">Admin → Databases → Edit → Host: your-read-replica-endpoint</h1>

    For RDS, add read replicas as query load increases:

    hcl
    

    resource "aws_db_instance" "analytics_replica" { identifier = "analytics-replica-1" replicate_source_db = aws_db_instance.analytics_primary.identifier instance_class = "db.r6g.xlarge" publicly_accessible = false skip_final_snapshot = true auto_minor_version_upgrade = true }

    Strategy 4: Connection Pooling (PgBouncer)

    At high concurrency, Metabase may exhaust the connection pool to your analytics database. PgBouncer multiplexes many Metabase connections into fewer actual database connections:

    yaml
    

    <h1 class="text-4xl font-bold mb-6 text-slate-900">docker-compose: add PgBouncer between Metabase and PostgreSQL</h1> pgbouncer: image: pgbouncer/pgbouncer:latest environment: DATABASE_URL: "postgres://metabase_reader:pass@analytics-db:5432/analytics" POOL_MODE: transaction # pool connections per transaction MAX_CLIENT_CONN: 500 # max connections from Metabase DEFAULT_POOL_SIZE: 25 # actual connections to PostgreSQL RESERVE_POOL_SIZE: 5 RESERVE_POOL_TIMEOUT: 3

    Point Metabase at PgBouncer instead of PostgreSQL directly. PgBouncer makes 25 database connections appear as 500 to Metabase.

    Strategy 5: Horizontal Scaling Considerations

    Metabase does not support active-active horizontal scaling — running two instances simultaneously against the same application database causes conflicts on scheduled jobs, cache warming, and other singleton operations.

    What is possible:

  • Read-heavy horizontal scaling: Run a second Metabase instance in read-only mode that serves only embedded dashboard queries (no admin, no writes). This requires careful configuration and is not officially supported.
  • Metabase Cloud: Metabase's hosted service handles horizontal scaling internally. Teams with 2,000+ concurrent users should evaluate Metabase Cloud rather than self-managed horizontal scaling.
  • The practical ceiling: A single well-configured Metabase instance (8 vCPU, 16GB RAM) with aggressive caching and optimized database connections handles 1,000–2,000 concurrent embedded analytics viewers. Above this, the complexity of self-managed scaling typically makes Metabase Cloud or a managed embedding solution more cost-effective.

    ---

    Monitoring at Scale

    At high user counts, proactive monitoring is essential. Track these metrics:

    Metabase-Level Metrics

    javascript
    

    // Monitor query queue depth via Metabase API async function getQueryStats() { const stats = await fetch(${METABASE_URL}/api/util/stats, { headers: { "x-api-key": API_KEY }, }).then(r => r.json());

    return { active_queries: stats.running_queries, cached_queries: stats.cached_query_count, uptime_seconds: stats.uptime, }; }

    Database-Level Metrics

    sql
    

    -- PostgreSQL: monitor active connections from Metabase SELECT count(*) as total_connections, count(*) FILTER (WHERE state = 'active') as active, count(*) FILTER (WHERE state = 'idle') as idle, count(*) FILTER (WHERE state = 'idle in transaction') as idle_in_tx, count(*) FILTER (WHERE wait_event_type = 'Lock') as waiting_on_lock FROM pg_stat_activity WHERE application_name LIKE '%metabase%';

    -- Average and p95 query duration SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY mean_exec_time) as p50_ms, percentile_cont(0.95) WITHIN GROUP (ORDER BY mean_exec_time) as p95_ms, percentile_cont(0.99) WITHIN GROUP (ORDER BY mean_exec_time) as p99_ms FROM pg_stat_statements WHERE query LIKE '%from orders%';

    Key Alerts at Scale

    MetricWarning thresholdCritical threshold
    Metabase heap usage> 70%> 85%
    DB connection pool exhaustion> 80% utilized> 95% utilized
    Query p95 latency> 5s> 15s
    Cache hit rate< 60%< 40%
    Application DB connections> 80% of max> 95% of max
    ---

    Load Testing Before Launch

    Before exposing embedded analytics to a large user base, load test with realistic traffic patterns:

    javascript
    

    // k6 load test: simulate embedded dashboard loads import http from 'k6/http'; import { sleep } from 'k6';

    export const options = { stages: [ { duration: '2m', target: 100 }, // ramp to 100 users { duration: '5m', target: 500 }, // ramp to 500 { duration: '10m', target: 1000 }, // hold at 1000 { duration: '2m', target: 0 }, // ramp down ], thresholds: { http_req_duration: ['p95<3000'], // 95% of requests under 3s http_req_failed: ['rate<0.01'], // < 1% error rate }, };

    export default function() { // Simulate fetching an embed URL from your backend const orgId = Math.floor(Math.random() * 1000) + 1;

    const embedUrl = http.get( https://your-api.com/analytics/embed-url?orgId=${orgId}, { headers: { Authorization: Bearer test-token-${orgId} } } );

    // Simulate the iframe loading the Metabase dashboard if (embedUrl.status === 200) { const { embedUrl: url } = embedUrl.json(); http.get(url); }

    sleep(Math.random() * 5 + 1); // 1-6 second think time between requests }

    Run load tests against a staging environment with production-representative data volumes. The results will identify whether the bottleneck is Metabase's application tier, the database connection pool, or the analytics database itself.

    ---

    Summary

    Scaling Metabase to 1,000+ users is achievable with a single instance by combining aggressive caching (the highest-leverage intervention), vertical scaling of the Metabase container (4–8 vCPU, 8–16GB RAM), read replicas for analytics databases, and connection pooling via PgBouncer. Metabase does not support active-active horizontal scaling — teams exceeding ~2,000 concurrent users should evaluate Metabase Cloud. Monitor heap usage, database connection pool saturation, query p95 latency, and cache hit rate. Load test before launch with realistic traffic patterns including multi-tenant parameter variation to identify bottlenecks before they affect real users.