What is a Data Warehouse?
A data warehouse is a centralized repository designed for query and analysis rather than transaction processing. It stores historical data from various sources in a structured format optimized for analytical queries.
Modern cloud data warehouses like Snowflake, BigQuery, and Redshift have revolutionized analytics by separating storage and compute, enabling massive scale and cost efficiency.
Dimensional Modeling
Dimensional modeling organizes data into facts (measurements) and dimensions (context):
- Fact Tables: Contain measurable, quantitative data (sales, clicks, transactions)
- Dimension Tables: Contain descriptive attributes (customers, products, dates)
- Star Schema: Fact table in center, dimension tables around it
- Snowflake Schema: Normalized dimensions with sub-dimensions
Star Schema Design
-- Fact Table: Sales
CREATE TABLE fact_sales (
sale_id BIGINT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
customer_key INT REFERENCES dim_customer(customer_key),
product_key INT REFERENCES dim_product(product_key),
store_key INT REFERENCES dim_store(store_key),
quantity INT,
unit_price DECIMAL(10,2),
total_amount DECIMAL(12,2),
discount_amount DECIMAL(10,2)
);
-- Dimension Table: Customer
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50),
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255),
segment VARCHAR(50),
city VARCHAR(100),
state VARCHAR(100),
country VARCHAR(100),
-- SCD Type 2 fields
effective_date DATE,
end_date DATE,
is_current BOOLEAN
);
-- Dimension Table: Date
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
full_date DATE,
day_of_week INT,
day_name VARCHAR(10),
month INT,
month_name VARCHAR(10),
quarter INT,
year INT,
is_weekend BOOLEAN,
is_holiday BOOLEAN
);
Slowly Changing Dimensions (SCD)
-- SCD Type 1: Overwrite (lose history)
UPDATE dim_customer
SET email = 'new@email.com'
WHERE customer_id = 'C001';
-- SCD Type 2: Add new row (preserve history)
-- First, close the current record
UPDATE dim_customer
SET end_date = CURRENT_DATE - 1,
is_current = FALSE
WHERE customer_id = 'C001'
AND is_current = TRUE;
-- Then insert new record
INSERT INTO dim_customer (
customer_key, customer_id, first_name, last_name,
email, segment, effective_date, end_date, is_current
) VALUES (
NEW_KEY, 'C001', 'John', 'Doe',
'new@email.com', 'Premium',
CURRENT_DATE, '9999-12-31', TRUE
);
-- SCD Type 3: Add column for previous value
ALTER TABLE dim_customer
ADD COLUMN previous_segment VARCHAR(50);
UPDATE dim_customer
SET previous_segment = segment,
segment = 'Premium'
WHERE customer_id = 'C001';
Modern Cloud Warehouses
-- Snowflake Example
-- Create warehouse (compute)
CREATE WAREHOUSE analytics_wh
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;
-- Create database and schema
CREATE DATABASE analytics;
CREATE SCHEMA analytics.sales;
-- Create table with clustering
CREATE TABLE analytics.sales.fact_orders (
order_id NUMBER,
order_date DATE,
customer_id NUMBER,
total_amount NUMBER(12,2)
)
CLUSTER BY (order_date);
-- Load data from S3
COPY INTO analytics.sales.fact_orders
FROM 's3://bucket/orders/'
FILE_FORMAT = (TYPE = PARQUET);
-- Time travel query
SELECT * FROM analytics.sales.fact_orders
AT(TIMESTAMP => '2024-01-01 00:00:00'::TIMESTAMP);
BigQuery Example
-- Create dataset
CREATE SCHEMA `project.analytics`;
-- Create partitioned table
CREATE TABLE `project.analytics.fact_events`
(
event_id STRING,
event_timestamp TIMESTAMP,
user_id STRING,
event_type STRING,
properties JSON
)
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type;
-- Query with partition filter
SELECT
event_type,
COUNT(*) as event_count
FROM `project.analytics.fact_events`
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY event_type;
-- Create materialized view
CREATE MATERIALIZED VIEW `project.analytics.daily_events`
AS
SELECT
DATE(event_timestamp) as event_date,
event_type,
COUNT(*) as event_count
FROM `project.analytics.fact_events`
GROUP BY 1, 2;
ETL vs ELT
- ETL (Extract, Transform, Load): Transform data before loading into warehouse. Used with traditional warehouses where compute is limited.
- ELT (Extract, Load, Transform): Load raw data first, transform using warehouse compute. Preferred with modern cloud warehouses.
-- ELT Pattern with dbt
-- 1. Load raw data (using Fivetran, Airbyte, etc.)
-- 2. Transform in warehouse
-- models/staging/stg_orders.sql
SELECT
id as order_id,
created_at as order_date,
customer_id,
CAST(total_cents AS DECIMAL) / 100 as total_amount
FROM {{ source('raw', 'orders') }}
-- models/marts/fct_orders.sql
SELECT
o.order_id,
o.order_date,
c.customer_name,
o.total_amount,
SUM(o.total_amount) OVER (
PARTITION BY o.customer_id
ORDER BY o.order_date
) as cumulative_spend
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('dim_customers') }} c
ON o.customer_id = c.customer_id
Performance Optimization
- Partitioning: Divide tables by date or category for faster queries
- Clustering: Co-locate related data for better scan performance
- Materialized Views: Pre-compute expensive aggregations
- Column Statistics: Enable query optimizer with accurate stats
- Result Caching: Cache query results for repeated queries
- Avoid SELECT *: Query only needed columns
Best Practices
- Design for analytics: Denormalize for query performance
- Use surrogate keys: Don't rely on natural keys from source systems
- Track history: Implement SCD Type 2 for important dimensions
- Document everything: Maintain data dictionary and lineage
- Test data quality: Validate data at every stage
- Monitor costs: Track query costs in cloud warehouses
Master Data Warehousing with Expert Mentorship
Our Data Engineering program covers dimensional modeling and modern cloud warehouses. Build analytics platforms with guidance from industry experts.
Explore Data Engineering Program