Do you need a data warehouse? A dev's guide to the honest answer
Do you need a data warehouse? Understand when traditional databases stop scaling and when a warehouse becomes essential for analytics.
Do you need a data warehouse? A dev's guide to the honest answer
Quick answers
| Question | Answer |
|---|---|
| What is a data warehouse? | A separate database optimized for analytical queries rather than transactional workloads. Data is copied from your production systems and transformed before being loaded. |
| Do I need one right now? | Probably not yet. Most teams can query their production database directly with Metabase and get useful, current answers. A warehouse becomes necessary when analytical queries are affecting production performance, or when you need to regularly join data from many different sources. |
| What's the difference between a data warehouse and a production database? | Production databases are optimized for fast reads and writes on individual records — transactional workloads. Warehouses are optimized for aggregations across large datasets and complex joins — analytical workloads. Different engines, different query patterns. |
| When does Metabase work without a warehouse? | Always, for the right use cases. Metabase connects directly to Postgres, MySQL, and most production databases. If your analytical queries are fast enough and aren't affecting production performance, you don't need a warehouse before you get value. |
| What are the popular data warehouses? | BigQuery (Google), Snowflake, Redshift (AWS), and Databricks are the main cloud options. All connect directly to Metabase. |
| What's ELT vs ETL? | ETL transforms data before loading it. ELT loads raw data first, then transforms it inside the warehouse — typically with dbt. ELT is the modern default. Metabase's ETL/ELT guide covers the full tradeoffs. |
"Do we need a data warehouse?" is one of the most common questions growing data teams ask — and the answer is almost always "not yet, and maybe never."
Here's how to think through it without cargo-culting the modern data stack.
What a data warehouse actually does
A data warehouse is a separate database built for analytical queries: aggregations across large datasets, joins across multiple sources, historical trend analysis. Unlike a production database — which is optimized for fast reads and writes on individual records — a warehouse is designed around different query patterns entirely.
The classic architecture: data flows from your production systems into a warehouse via ETL or ELT, gets transformed into analytics-friendly structures, and then gets queried by your BI tool.
The question is whether you actually need that middle layer.
When you don't need a warehouse yet
You can skip a warehouse and query your production database directly with Metabase if:
Your analytical queries don't affect production performance. The real concern isn't query speed — it's whether analytical queries are competing with user-facing queries for database resources. If they're not, you're fine.
You're mostly querying one database. A small SaaS with simple data: users table, subscriptions table, usage events in the same DB. You don't need to split services; you can query everything directly.
You're still figuring out what questions to ask. Building a warehouse before you know what metrics matter is premature. The patterns that actually need a warehouse only become clear once you've been querying your data for a while.
Metabase connects directly to your production database — Postgres, MySQL, and 20+ others — and queries it in real time. No warehouse required to get started.
When a warehouse starts making sense
Analytical queries are affecting your production app. Long-running aggregations that scan large tables can compete with user-facing queries for resources. A read replica solves this at first; a warehouse is the right answer when you outgrow that.
You need to join data from multiple systems regularly. Product DB + Stripe + Salesforce + marketing data is a sign. If you're writing one-off scripts to combine data from different sources before you can answer a question, a warehouse gives you one place to load and join them. Metabase's SQL editor queries the warehouse directly once it's in place.
You need historical snapshots. Production databases store the current state of records. If you need to answer "what did our churn look like six months ago?", you need a system that records history — which is a core warehouse pattern.
Query performance is genuinely bottlenecked at scale. Cloud data warehouses separate storage and compute, letting you scale each independently. For very large datasets and complex aggregations, that architecture wins. But this is a scale problem, not a day-one problem.
What a read replica gets you first
Before committing to a full warehouse, a read replica of your production database is often the right intermediate step:
- Analytical queries go to the replica, not production
This buys you significant headroom without the overhead of a full ELT pipeline.
The practical path
For the data transformation layer once you have a warehouse, Metabase's ETL/ELT guide covers the landscape. For the full picture of how the data stack fits together, the data landscape guide is a good foundation.
Key takeaways
| Point | Detail |
|---|---|
| Don't add a warehouse before you need it | Query production directly first; the constraints become clear over time |
| The real signal is production load, not data size | Analytical queries competing with user queries is the trigger, not row count |
| A read replica is the right intermediate step | Same data, no transformation overhead, analytical load goes elsewhere |
| Metabase works with or without a warehouse | Connects to Postgres, MySQL, BigQuery, Redshift, Snowflake, and 20+ others the same way |
| Add a warehouse for cross-source joins, historical snapshots, or scale | Not as a default — as a solution to a specific problem you've hit |