Apache Iceberg: Data Lake Table Format

What is Apache Iceberg?

Apache Iceberg is an open table format for huge analytic datasets. Originally developed at Netflix, Iceberg provides reliable tables with ACID transactions, schema evolution, and time travel capabilities.

Unlike Hive tables that depend on directories and file listings, Iceberg tracks individual files and their metadata, enabling much faster query planning and reliable transactions.

Iceberg vs Delta Lake vs Hudi

┌─────────────────────────────────────────────────────────────────┐
│              Table Format Comparison                             │
├─────────────────────────────────────────────────────────────────┤
│ Feature           │ Iceberg  │ Delta Lake │ Hudi     │ Hive     │
├───────────────────┼──────────┼────────────┼──────────┼──────────┤
│ ACID Transactions │    ✓     │     ✓      │    ✓     │    ✗     │
│ Time Travel       │    ✓     │     ✓      │    ✓     │    ✗     │
│ Schema Evolution  │    ✓     │     ✓      │    ✓     │  Limited │
│ Hidden Partition  │    ✓     │     ✗      │    ✗     │    ✗     │
│ Engine Agnostic   │    ✓     │   Limited  │    ✓     │    ✓     │
│ Row-Level Deletes │    ✓     │     ✓      │    ✓     │    ✗     │
│ Concurrent Writes │    ✓     │     ✓      │    ✓     │    ✗     │
├───────────────────┴──────────┴────────────┴──────────┴──────────┤
│                                                                  │
│ Iceberg strengths:                                               │
│ • Truly open format (works with Spark, Flink, Trino, Dremio)    │
│ • Hidden partitioning (no partition columns in queries)         │
│ • Partition evolution without rewriting data                     │
│ • Excellent for very large tables (petabyte scale)              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Getting Started with Spark

from pyspark.sql import SparkSession

# Configure Spark with Iceberg
spark = SparkSession.builder \
    .appName("IcebergDemo") \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.0") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.demo.type", "hadoop") \
    .config("spark.sql.catalog.demo.warehouse", "s3://my-bucket/warehouse") \
    .getOrCreate()

# Create database
spark.sql("CREATE DATABASE IF NOT EXISTS demo.analytics")

# Create Iceberg table
spark.sql("""
    CREATE TABLE demo.analytics.events (
        event_id STRING,
        user_id STRING,
        event_type STRING,
        event_time TIMESTAMP,
        properties MAP
    )
    USING iceberg
    PARTITIONED BY (days(event_time))
""")

# Insert data
df = spark.createDataFrame([
    ("e1", "u1", "click", "2024-12-24 10:00:00", {"page": "home"}),
    ("e2", "u2", "purchase", "2024-12-24 11:00:00", {"amount": "99.99"})
], ["event_id", "user_id", "event_type", "event_time", "properties"])

df.writeTo("demo.analytics.events").append()

Hidden Partitioning

Iceberg's killer feature - partition values are derived from data, not exposed to users.

-- Create table with hidden partitioning
CREATE TABLE demo.analytics.logs (
    log_id STRING,
    message STRING,
    log_time TIMESTAMP,
    level STRING
)
USING iceberg
PARTITIONED BY (
    days(log_time),      -- Partition by day
    bucket(16, log_id)   -- Hash bucket for even distribution
)

-- Query WITHOUT specifying partition columns!
-- Iceberg figures out partitions automatically
SELECT * FROM demo.analytics.logs
WHERE log_time BETWEEN '2024-12-01' AND '2024-12-31'
  AND level = 'ERROR'

-- Benefits:
-- 1. Users don't need to know partition scheme
-- 2. Partition scheme can change without breaking queries
-- 3. No partition columns cluttering results

-- Partition transforms available:
-- years(ts)      → Extract year
-- months(ts)     → Extract year-month
-- days(ts)       → Extract date
-- hours(ts)      → Extract hour
-- bucket(n, col) → Hash into n buckets
-- truncate(w, s) → Truncate string to width w

Schema Evolution

-- Add new column
ALTER TABLE demo.analytics.events
ADD COLUMN browser STRING

-- Rename column
ALTER TABLE demo.analytics.events
RENAME COLUMN properties TO metadata

-- Change column type (safe widening only)
ALTER TABLE demo.analytics.events
ALTER COLUMN user_id TYPE BIGINT

-- Drop column
ALTER TABLE demo.analytics.events
DROP COLUMN browser

-- Reorder columns
ALTER TABLE demo.analytics.events
ALTER COLUMN event_type AFTER user_id

-- Make column required/optional
ALTER TABLE demo.analytics.events
ALTER COLUMN user_id SET NOT NULL

ALTER TABLE demo.analytics.events
ALTER COLUMN user_id DROP NOT NULL

-- Schema evolution is tracked in metadata
-- Old data files are NOT rewritten

Time Travel

# View table history
spark.sql("SELECT * FROM demo.analytics.events.history").show()
# +------------------+-------------------+---------+
# |     made_current |     snapshot_id   | parent_id|
# +------------------+-------------------+---------+
# | 2024-12-24 10:00 | 1234567890123456  |   null  |
# | 2024-12-24 12:00 | 1234567890123457  |  ...    |
# +------------------+-------------------+---------+

# Read at specific snapshot
spark.sql("""
    SELECT * FROM demo.analytics.events
    VERSION AS OF 1234567890123456
""")

# Read at specific timestamp
spark.sql("""
    SELECT * FROM demo.analytics.events
    TIMESTAMP AS OF '2024-12-24 10:00:00'
""")

# Rollback to previous version
spark.sql("""
    CALL demo.system.rollback_to_snapshot('analytics.events', 1234567890123456)
""")

# Cherry-pick changes from a snapshot
spark.sql("""
    CALL demo.system.cherrypick_snapshot('analytics.events', 1234567890123457)
""")

Data Maintenance

# Expire old snapshots (keep last 5 days)
spark.sql("""
    CALL demo.system.expire_snapshots(
        table => 'analytics.events',
        older_than => TIMESTAMP '2024-12-19 00:00:00',
        retain_last => 10
    )
""")

# Remove orphan files (not referenced by any snapshot)
spark.sql("""
    CALL demo.system.remove_orphan_files(
        table => 'analytics.events',
        older_than => TIMESTAMP '2024-12-20 00:00:00'
    )
""")

# Compact small files
spark.sql("""
    CALL demo.system.rewrite_data_files(
        table => 'analytics.events',
        options => map('target-file-size-bytes', '134217728')  -- 128MB
    )
""")

# Rewrite manifests (improve query planning)
spark.sql("""
    CALL demo.system.rewrite_manifests('analytics.events')
""")

Row-Level Updates & Deletes

-- Delete rows
DELETE FROM demo.analytics.events
WHERE event_type = 'test'
  AND event_time < '2024-01-01'

-- Update rows
UPDATE demo.analytics.events
SET event_type = 'page_view'
WHERE event_type = 'view'

-- Merge (upsert)
MERGE INTO demo.analytics.events t
USING staging.new_events s
ON t.event_id = s.event_id
WHEN MATCHED THEN
    UPDATE SET t.properties = s.properties
WHEN NOT MATCHED THEN
    INSERT *

-- Iceberg uses two deletion strategies:
-- 1. Copy-on-write (COW): Rewrite affected files
-- 2. Merge-on-read (MOR): Write delete files, merge at read time

-- Configure write mode
ALTER TABLE demo.analytics.events
SET TBLPROPERTIES (
    'write.delete.mode' = 'merge-on-read',
    'write.update.mode' = 'merge-on-read'
)

Partition Evolution

-- Start with monthly partitioning
CREATE TABLE demo.analytics.sales (
    sale_id STRING,
    amount DECIMAL(10,2),
    sale_time TIMESTAMP
) USING iceberg
PARTITIONED BY (months(sale_time))

-- Later, evolve to daily partitioning
-- NO data rewrite needed!
ALTER TABLE demo.analytics.sales
ADD PARTITION FIELD days(sale_time)

-- Remove old partition scheme
ALTER TABLE demo.analytics.sales
DROP PARTITION FIELD months(sale_time)

-- Iceberg handles both partition schemes:
-- - Old files: queried using monthly partition
-- - New files: written with daily partition
-- - Queries work transparently on both

Best Practices

Use hidden partitioning: Let Iceberg manage partitions
Right-size files: Target 128MB-512MB per file
Compact regularly: Run rewrite_data_files to merge small files
Expire snapshots: Clean up old metadata to save storage
Use MOR for frequent updates: Merge-on-read for write-heavy tables
Partition by time: days(ts) or hours(ts) for time-series data

Master Apache Iceberg

Our Data Engineering program covers modern table formats and lakehouse architecture.

Explore Data Engineering Program

Apache Iceberg