Clean and Standardize a Messy Dataset

Real data is never clean. It arrives with inconsistent casing, missing values in unexpected places, duplicate rows from bad imports, and dates formatted three different ways in the same column. Cleaning it isn't glamorous, but it's where most analysis time actually goes. This project walks through the operations you'll reach for constantly.

No API key, no credentials. Pandas is pre-installed in Colab.

Start with a deliberately messy DataFrame — the kind you'd get from a CSV export that went through three different hands:

import pandas as pd
import numpy as np

data = {
    "customer_id": [1, 2, 2, 3, 4, 5, 5],
    "name":        ["Alice Chen", "  bob smith", "BOB SMITH", "carol white", "Dave K.", None, "Eve M."],
    "email":       ["alice@co.com", "bob@co.com", "bob@co.com", "CAROL@CO.COM", "dave@co.com", "eve@co.com", "eve@co.com"],
    "status":      ["active", "Active", "active", None, "inactive", "ACTIVE", "active"],
    "revenue":     [1200, 450, 450, None, 890, 320, 320],
    "joined":      ["2024-01-05", "Jan 8 2024", "2024-01-08", "2024/02/10", "03-15-2024", "2024-03-22", "2024-03-22"],
}

df = pd.DataFrame(data)
print(f"Shape: {df.shape}")
print(df.dtypes)

Step 1 — Normalize text fields. Inconsistent casing and whitespace will silently break groupby operations and string comparisons:

df["name"]   = df["name"].str.strip().str.title()
df["email"]  = df["email"].str.strip().str.lower()
df["status"] = df["status"].str.strip().str.lower()

Step 2 — Handle missing values deliberately. Don't fill everything with the same value — decide per column:

df["status"]  = df["status"].fillna("unknown")
df["revenue"] = df["revenue"].fillna(0)
df["name"]    = df["name"].fillna("Unknown")
print(df.isnull().sum())  # verify nothing remains

Step 3 — Parse dates with mixed formats. pd.to_datetime with dayfirst=False and errors='coerce' handles most real-world formats gracefully:

df["joined"] = pd.to_datetime(df["joined"], dayfirst=False, errors="coerce")
print(df[["name", "joined"]])

Step 4 — Remove duplicates. Exact duplicates are easy. Duplicates on a key column with conflicting other values require a decision:

# Drop fully identical rows
df = df.drop_duplicates()

# Keep first occurrence per customer_id (policy decision — document it)
df = df.drop_duplicates(subset="customer_id", keep="first")

print(f"Final shape: {df.shape}")

Step 5 — Enforce types. Confirm numeric and date columns are the right dtype before any analysis:

df["revenue"]     = pd.to_numeric(df["revenue"], errors="coerce")
df["customer_id"] = df["customer_id"].astype(int)
print(df.dtypes)
print(df)

The decisions that matter aren't the code — they're what you fill nulls with, which duplicate you keep, and whether a coerced value becoming NaN is a problem or acceptable. Document those choices. They affect every analysis that runs on the cleaned data downstream.

Want to keep building? Browse our full resource library →