Build a Cohort Retention Table

A cohort retention table answers a simple question: of the customers who first bought in January, what percentage came back in February? In March? Six months later? It's the clearest single view of whether your product or service creates genuine habit. Every serious analyst eventually builds one.

The logic isn't complicated, but it takes a few specific pandas moves that aren't obvious. This project walks through each one: tagging each customer with their cohort month, computing how many months have elapsed since acquisition, then pivoting into the matrix format that makes the table readable.

Step 1 — Generate transaction data. Each row is a purchase: a customer ID, a date. That's all you need:

import pandas as pd
import numpy as np

# Simulate 6 months of transactions across 50 customers
np.random.seed(42)

rows = []
for cust_id in range(1, 51):
    # Each customer acquires in a random month
    acq_month = np.random.choice([0, 1, 2, 3], p=[0.35, 0.30, 0.20, 0.15])
    for month_offset in range(acq_month, 6):
        # Retention probability drops each month
        retain_prob = 0.85 ** (month_offset - acq_month)
        if np.random.random() < retain_prob:
            order_date = pd.Timestamp('2026-01-01') + pd.DateOffset(months=month_offset)
            rows.append({'customer_id': cust_id, 'order_date': order_date})

df = pd.DataFrame(rows)
df['order_date'] = pd.to_datetime(df['order_date'])
print(f"{len(df)} transactions, {df['customer_id'].nunique()} customers")
print(df.head(10))

Step 2 — Tag each customer with their cohort month. A cohort is defined by when a customer first appeared — their minimum order date:

# Truncate dates to month period for grouping
df['order_month'] = df['order_date'].dt.to_period('M')

# Find each customer's first-purchase month — that's their cohort
first_purchase = df.groupby('customer_id')['order_month'].min().rename('cohort_month')
df = df.merge(first_purchase, on='customer_id')

print(df[['customer_id', 'order_month', 'cohort_month']].head(10))

Step 3 — Compute months since acquisition. This becomes the column axis of the retention table:

# Months elapsed between cohort month and order month
df['months_since_acq'] = (
    df['order_month'].dt.to_timestamp() - df['cohort_month'].dt.to_timestamp()
).dt.days // 30

print(df[['customer_id', 'cohort_month', 'order_month', 'months_since_acq']].head(10))

Step 4 — Build the retention matrix. Count distinct customers per cohort per elapsed month, then divide by cohort size:

# Count unique customers per cohort × month_since_acq cell
cohort_data = df.groupby(['cohort_month', 'months_since_acq'])['customer_id'].nunique()
cohort_table = cohort_data.unstack('months_since_acq')

# Cohort sizes (month 0 = 100% by definition)
cohort_sizes = cohort_table[0]

# Convert to retention rates
retention = cohort_table.divide(cohort_sizes, axis=0).round(2)
retention.columns = [f'Month {c}' for c in retention.columns]

print("\nCohort Retention Table:")
print(retention.to_string())

Month 0 is always 1.0 — every customer is retained in their acquisition month by definition. The numbers to the right show what fraction came back in each subsequent month. A healthy SaaS product might hold 60–70% in Month 1. An e-commerce store might see 20–30%.

Step 5 — Format it cleanly for sharing. Percentages read better than decimals:

# Format as percentages for readability
retention_pct = (retention * 100).round(1).astype(str) + '%'
retention_pct = retention_pct.replace('nan%', '-')

print("\nFormatted Retention Table:")
print(retention_pct.to_string())

The diagonal shape of the table — shorter rows for recent cohorts — is normal. January has 6 months of history. June has one. You can't compare them on absolute depth, but you can compare Month 1 retention across all cohorts. That comparison is what tells you whether things are improving.

Want to go deeper? Browse our full resource library →