Mastering Data Cleaning in Python: A Real-World Beginner to Professional Guide

Data cleaning is one of the most important skills in Data Science. Most beginners think Data Science is only about Machine Learning and AI, but in reality, professional Data Scientists spend most of their time cleaning, structuring, and validating messy data.

In this blog, we will learn:

How to load JSON data
How to inspect messy datasets
How to clean real-world data
How to generate insights
How to save cleaned data
How professional Data Scientists think

Why Data Cleaning Matters

Real-world data is never perfect.

You will encounter:

Missing values
Invalid emails
Wrong data types
Duplicate values
Mixed date formats
Inconsistent capitalization
Broken records
Nested JSON structures

If your data is not cleaned properly:

Analytics become wrong
Dashboards become unreliable
Machine Learning models fail
Business decisions become inaccurate

That is why Data Cleaning is one of the most valuable skills in Data Science.

Step 1 — Create a Realistic JSON Dataset

Save this file as:

company_data.json
{
  "employees": [

    {
      "id": "E101",
      "name": " Amit Sharma ",
      "email": "AMIT@example.COM ",
      "age": "24",
      "salary": "55000",
      "department": "it",
      "city": "delhi ",
      "skills": ["Python", "SQL", "python"],
      "projects_completed": "12",
      "is_active": "TRUE"
    },

    {
      "id": "E102",
      "name": "",
      "email": "priyaexample.com",
      "age": null,
      "salary": null,
      "department": "HR",
      "city": "Mumbai",
      "skills": ["Excel", "Communication"],
      "projects_completed": "five",
      "is_active": "False"
    },

    {
      "id": "E103",
      "name": "RAHUL",
      "email": "rahul@example.com",
      "age": "twenty six",
      "salary": "70000",
      "department": null,
      "city": "BANGALORE",
      "skills": [],
      "projects_completed": "-3",
      "is_active": true
    }
  ]
}

Step 2 — Load the JSON File

import json

def load_data(filename):

    with open(filename, "r") as f:
        return json.load(f)

data = load_data("company_data.json")

Understanding the Code

json.load(f)

Converts JSON file data into Python dictionaries and lists.

with open()

Safely opens and closes the file automatically.

return

Returns data back so it can be reused.

Step 3 — Inspect Raw Data

Professional Data Scientists NEVER clean blindly.

First, inspect the dataset.

print(type(data))
print(data.keys())
print(len(data['employees']))
print(data['employees'][0])

This helps us understand:

Structure
Columns
Missing values
Data quality
Nested relationships

Step 4 — Detect Missing Values

for emp in data['employees']:

    for key, value in emp.items():

        if value is None or value == "":
            print(emp['id'], "missing", key)

Why This Matters

Missing values are one of the biggest problems in real-world datasets.

Examples:

Missing salary
Missing email
Missing age
Empty names

Professionals audit datasets before cleaning them.

Step 5 — Create the Cleaning Function

Now we create a professional cleaning pipeline.

def clean_data(data):

    cleaned_employees = []

    for emp in data['employees']:

        # CLEAN NAME
        name = emp['name'].strip().title()

        if name == "":
            name = "Unknown"

        # CLEAN EMAIL
        email = emp['email'].strip().lower()

        if "@" not in email:
            email = None

        # CLEAN AGE
        try:
            age = int(emp['age'])
        except:
            age = None

        # CLEAN SALARY
        try:
            salary = float(emp['salary'])
        except:
            salary = 0

        # CLEAN DEPARTMENT
        dept = emp['department']

        if dept:
            dept = dept.upper()

        # CLEAN CITY
        city = emp['city'].strip().title()

        # CLEAN SKILLS
        skills = []

        for skill in emp['skills']:
            skills.append(skill.lower())

        skills = list(set(skills))

        # CLEAN PROJECTS
        try:
            projects = int(emp['projects_completed'])

            if projects < 0:
                projects = 0

        except:
            projects = 0

        # CLEAN BOOLEAN
        active = str(emp['is_active']).lower()

        is_active = active in ['true', '1', 'yes']

        cleaned_employees.append({

            "id": emp['id'],
            "name": name,
            "email": email,
            "age": age,
            "salary": salary,
            "department": dept,
            "city": city,
            "skills": skills,
            "projects_completed": projects,
            "is_active": is_active

        })

    return {"employees": cleaned_employees}

Step 6 — Run the Cleaning Pipeline

cleaned_data = clean_data(data)

At this stage:

Invalid emails become None
Names are standardized
Salary becomes numeric
Duplicate skills are removed
Boolean values become consistent
Negative project counts are fixed

Step 7 — Generate Business Insights

This is where cleaned data becomes valuable.

Total Employees

print("Total Employees:", len(cleaned_data['employees']))

Average Salary

total_salary = 0
count = 0

for emp in cleaned_data['employees']:

    if emp['salary'] > 0:
        total_salary += emp['salary']
        count += 1

average_salary = total_salary / count

print("Average Salary:", average_salary)

Most Skilled Employee

max_skills = 0
top_employee = ""

for emp in cleaned_data['employees']:

    if len(emp['skills']) > max_skills:

        max_skills = len(emp['skills'])
        top_employee = emp['name']

print(top_employee, "has most skills")

Active Employees

for emp in cleaned_data['employees']:

    if emp['is_active']:
        print(emp['name'], "is active")

Step 8 — Save the Cleaned Dataset

Professional workflows always save cleaned datasets.

def save_data(data, filename):

    with open(filename, "w") as f:

        json.dump(data, f, indent=4)

save_data(cleaned_data, "cleaned_company_data.json")

This creates a new clean JSON file ready for:

Analytics
Dashboards
Machine Learning
APIs
Reporting

Key Data Cleaning Concepts You Learned

Concept	Purpose
.strip()	Remove extra spaces
.lower()	Normalize text
.title()	Standardize capitalization
try/except	Handle invalid data safely
set()	Remove duplicates
Validation	Ensure data quality
Type conversion	Convert strings to numbers
Boolean normalization	Standardize True/False

How Professional Data Scientists Think

Weak programmers:

Just print data

Strong Data Scientists:

Inspect data
Validate data
Normalize formats
Detect anomalies
Build pipelines
Generate insights

Real Data Science is:

Raw Data → Clean Data → Insights → Decisions

Final Thoughts

If you want to become a strong Data Scientist, master:

JSON handling
Data structures
Cleaning pipelines
Validation
Relationships
Data transformations

Machine Learning becomes much easier once your data understanding becomes strong.

Remember:

The quality of your model depends on the quality of your data.

Clean data creates powerful systems.

Practice Tasks

Try solving these yourself:

Find invalid emails
Count employees department-wise
Find highest salary employee
Count skill frequency
Find employees with missing age
Create an analytics report dictionary
Export cleaned data to a new JSON file

Written by Aditya Data Scientist

Mastering Data Cleaning in Python: Real-World JSON Cleaning & Analytics Guide for Beginners