Data dictionaries are one of those things everyone agrees are valuable and nobody has time to write. A column named cust_acq_src_cd means something specific to the person who built the table and nothing to anyone else six months later. The fix should take five minutes, but across a table with 40 columns it takes an afternoon — so it doesn't get done.
This project uses Claude to generate column descriptions from schema information alone: column names, data types, and a few sample values. It won't be perfect, but it will be 80% there in seconds, and 80% is enough to hand off or publish.
Setup — get your Anthropic API key. Anthropic offers free credits when you sign up at console.anthropic.com. Store your key in Colab Secrets:
# In Colab: open the Secrets panel (🔑 icon in the left sidebar)
# Add your key with the name ANTHROPIC_API_KEY, then enable notebook access.
!pip install anthropic
from google.colab import userdata
import anthropic
import pandas as pd
client = anthropic.Anthropic(api_key=userdata.get('ANTHROPIC_API_KEY'))
Step 1 — Create a sample dataset. Realistic column names are the whole point — use names that actually need explaining:
# A realistic orders table with non-obvious column names
data = {
'ord_id': [1001, 1002, 1003, 1004, 1005],
'cust_acq_src': ['organic', 'paid_search', 'referral', 'organic', 'email'],
'ord_status_cd': ['COMP', 'PEND', 'COMP', 'RFND', 'COMP'],
'gmv_usd': [149.99, 89.00, 214.50, 49.95, 320.00],
'disc_pct': [0.0, 0.10, 0.0, 0.15, 0.05],
'is_first_ord': [True, False, True, True, False],
'ord_ts': ['2026-01-03 14:22','2026-01-04 09:11','2026-01-05 16:44',
'2026-01-06 11:30','2026-01-07 08:55'],
'ship_region_cd': ['NE', 'SW', 'MW', 'NE', 'SE']
}
df = pd.DataFrame(data)
print(df.head())
Step 2 — Build the schema prompt. For each column, pass the name, type, and a few sample values. That's enough for Claude to infer meaning:
# Build a schema summary for each column
schema_lines = []
for col in df.columns:
dtype = str(df[col].dtype)
samples = df[col].dropna().head(3).tolist()
schema_lines.append(f"- {col} ({dtype}): sample values = {samples}")
schema_text = "\n".join(schema_lines)
print(schema_text)
Step 3 — Send to Claude and get the dictionary. A clear system role and a structured ask produces structured output:
# Ask Claude to write a data dictionary
message = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""You are a senior data engineer writing documentation for a data team.
Below is a table schema with column names, data types, and sample values.
Write a data dictionary: for each column, provide a one-sentence plain-English description
of what it most likely represents. Be specific and avoid generic descriptions.
Schema:
{schema_text}
Return a markdown table with columns: Column | Type | Description"""
}]
)
print(message.content[0].text)
The model will infer that gmv_usd is Gross Merchandise Value in US dollars, that ord_status_cd contains status codes, and that is_first_ord flags a customer's first order — all without being told. That's the useful part.
Step 4 — Export the result. Capture the markdown and save it alongside your data:
# Save the dictionary to a text file
with open('/tmp/data_dictionary.md', 'w') as f:
f.write(message.content[0].text)
print("Saved to /tmp/data_dictionary.md")
The value here isn't that Claude is always right. It's that a draft you can correct in five minutes is infinitely more useful than a blank document no one will ever fill in.
Want to go deeper? Browse our full resource library →