Most analysts reach for pd.read_csv() out of habit. Load the file, then query it. DuckDB flips that — you write SQL against the file itself, and DuckDB handles the rest. For large files, this isn't just convenient. It's meaningfully faster, and it uses a fraction of the memory.

DuckDB is an in-process analytical database. There's nothing to install beyond the Python package, no server, no config. It runs inside your Colab session and disappears when the notebook closes.


Step 1 — Install and connect. One install, one import:

!pip install duckdb

import duckdb
import pandas as pd

# Connect to an in-memory DuckDB instance
con = duckdb.connect()

Step 2 — Create a sample CSV to query. In Colab we write the file to /tmp, then query it directly:

# Build a sample sales CSV and write it to disk
data = {
    'date':    ['2026-01-03','2026-01-07','2026-01-12','2026-01-18','2026-01-22',
                '2026-02-01','2026-02-09','2026-02-14','2026-02-20','2026-03-05',
                '2026-03-11','2026-03-18'],
    'region':  ['East','West','North','East','West','North','East','West','North','East','West','North'],
    'product': ['Pro','Starter','Pro','Starter','Pro','Pro','Starter','Pro','Starter','Pro','Starter','Pro'],
    'revenue': [1200,450,1800,390,1650,1100,420,1750,380,1350,470,1500],
    'units':   [1,2,1,2,1,1,2,1,2,1,2,1]
}

df = pd.DataFrame(data)
df.to_csv('/tmp/sales.csv', index=False)
print(f"Wrote {len(df)} rows to /tmp/sales.csv")

Step 3 — Query the file directly. The path goes straight into SQL — no read_csv(), no DataFrame first:

# Query the CSV file directly — DuckDB reads it on the fly
result = con.execute("""
  SELECT *
  FROM '/tmp/sales.csv'
  ORDER BY date
""").df()
print(result)
# Aggregate by region without loading into pandas first
result = con.execute("""
  SELECT
    region,
    COUNT(*)          AS orders,
    SUM(revenue)      AS total_revenue,
    ROUND(AVG(revenue), 0) AS avg_order
  FROM '/tmp/sales.csv'
  GROUP BY region
  ORDER BY total_revenue DESC
""").df()
print(result)

The .df() at the end converts the DuckDB result to a pandas DataFrame — handy when you need the result for further Python work. If you don't need pandas, you can call .fetchall() instead and get plain Python tuples.


Step 4 — Use window functions and filters, same as any SQL. DuckDB supports the full SQL dialect including CTEs, window functions, and QUALIFY:

# Rank products by revenue within each region
result = con.execute("""
  SELECT
    region,
    product,
    SUM(revenue) AS revenue,
    RANK() OVER (PARTITION BY region ORDER BY SUM(revenue) DESC) AS rank
  FROM '/tmp/sales.csv'
  GROUP BY region, product
  QUALIFY rank = 1
""").df()
print(result)
con.close()

QUALIFY is a DuckDB extension that filters on window function results directly — no subquery needed. It's one of several places where DuckDB's SQL is noticeably cleaner than standard SQLite.

The real payoff comes when your CSV has a million rows. DuckDB reads only what the query needs — it doesn't load the whole file. pandas loads everything first. At scale, that difference is not small.


Want to go deeper? Browse our full resource library →