Auto-Generate a Data Dictionary with Claude

Data dictionaries are one of those things everyone agrees are valuable and nobody has time to write. A column named cust_acq_src_cd means something specific to the person who built the table and nothing to anyone else six months later. The fix should take five minutes, but across a table with 40 columns it takes an afternoon — so it doesn't get done.

This project uses Claude to generate column descriptions from schema information alone: column names, data types, and a few sample values. It won't be perfect, but it will be 80% there in seconds, and 80% is enough to hand off or publish.

Setup — get your Anthropic API key. Anthropic offers free credits when you sign up at console.anthropic.com. Store your key in Colab Secrets:

# In Colab: open the Secrets panel (🔑 icon in the left sidebar)
# Add your key with the name ANTHROPIC_API_KEY, then enable notebook access.

!pip install anthropic

from google.colab import userdata
import anthropic
import pandas as pd

client = anthropic.Anthropic(api_key=userdata.get('ANTHROPIC_API_KEY'))

Step 1 — Create a sample dataset. Realistic column names are the whole point — use names that actually need explaining:

# A realistic orders table with non-obvious column names
data = {
    'ord_id':         [1001, 1002, 1003, 1004, 1005],
    'cust_acq_src':   ['organic', 'paid_search', 'referral', 'organic', 'email'],
    'ord_status_cd':  ['COMP', 'PEND', 'COMP', 'RFND', 'COMP'],
    'gmv_usd':        [149.99, 89.00, 214.50, 49.95, 320.00],
    'disc_pct':       [0.0, 0.10, 0.0, 0.15, 0.05],
    'is_first_ord':   [True, False, True, True, False],
    'ord_ts':         ['2026-01-03 14:22','2026-01-04 09:11','2026-01-05 16:44',
                       '2026-01-06 11:30','2026-01-07 08:55'],
    'ship_region_cd': ['NE', 'SW', 'MW', 'NE', 'SE']
}

df = pd.DataFrame(data)
print(df.head())

Step 2 — Build the schema prompt. For each column, pass the name, type, and a few sample values. That's enough for Claude to infer meaning:

# Build a schema summary for each column
schema_lines = []
for col in df.columns:
    dtype = str(df[col].dtype)
    samples = df[col].dropna().head(3).tolist()
    schema_lines.append(f"- {col} ({dtype}): sample values = {samples}")

schema_text = "\n".join(schema_lines)
print(schema_text)

Step 3 — Send to Claude and get the dictionary. A clear system role and a structured ask produces structured output:

# Ask Claude to write a data dictionary
message = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"""You are a senior data engineer writing documentation for a data team.

Below is a table schema with column names, data types, and sample values.
Write a data dictionary: for each column, provide a one-sentence plain-English description
of what it most likely represents. Be specific and avoid generic descriptions.

Schema:
{schema_text}

Return a markdown table with columns: Column | Type | Description"""
    }]
)

print(message.content[0].text)

The model will infer that gmv_usd is Gross Merchandise Value in US dollars, that ord_status_cd contains status codes, and that is_first_ord flags a customer's first order — all without being told. That's the useful part.

Step 4 — Export the result. Capture the markdown and save it alongside your data:

# Save the dictionary to a text file
with open('/tmp/data_dictionary.md', 'w') as f:
    f.write(message.content[0].text)

print("Saved to /tmp/data_dictionary.md")

The value here isn't that Claude is always right. It's that a draft you can correct in five minutes is infinitely more useful than a blank document no one will ever fill in.

Want to go deeper? Browse our full resource library →