Python Data Cleaning Tutorial: Learn Real-World JSON Data Cleaning Step-by-Step

Data Cleaning is one of the most important skills in Data Science. Most beginners focus only on Machine Learning and AI, but in reality, professional Data Scientists spend most of their time cleaning, validating, structuring, and analyzing messy data.

In this tutorial, you will learn:

How to inspect messy JSON data
How to detect data problems
How to clean duplicate records
How to validate relationships
How to normalize datasets
How professional Data Scientists think

This guide is beginner-friendly and designed to help you build strong Data Science fundamentals using Python.

Why Data Cleaning is Important

Real-world data is never perfect.

Datasets often contain:

Missing values
Duplicate records
Invalid relationships
Empty fields
Broken IDs
Inconsistent formatting

If your data is not cleaned properly:

Analytics become wrong
Dashboards become unreliable
Machine Learning models fail
Business decisions become inaccurate

That is why Data Cleaning is one of the most valuable skills in Data Science.

Example JSON Dataset

We will work with this real-world style JSON dataset.

{
    "users": [
        {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
        {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
        {"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]},
        {"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]},
        {"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
    ],
    "pages": [
        {"id": 101, "name": "Python Developers"},
        {"id": 102, "name": "Data Science Enthusiasts"},
        {"id": 103, "name": "AI & ML Community"},
        {"id": 104, "name": "Web Dev Hub"},
        {"id": 104, "name": "Web Development"}
    ]
}

Step 1 — Load JSON Data in Python

First, import the JSON module and load the dataset.

import json

with open("data.json", "r") as f:
    data = json.load(f)

Understanding the Code

Code	Meaning
open()	Opens the file
"r"	Read mode
json.load()	Converts JSON into Python dictionary
data	Stores the dataset

Step 2 — Understand the Data Structure

Before cleaning data, professional Data Scientists inspect the structure carefully.

print(data.keys())

Output:

dict_keys(['users', 'pages'])

This means our dataset contains:

users
pages

Step 3 — Find Missing Names

One user has an empty name.

Problem:

{"id": 3, "name": ""}

Detect Missing Values

for user in data['users']:

    if user['name'] == "":
        print("Missing name:", user)

Why This Matters

Missing values can break:

reports
analytics
dashboards
Machine Learning systems

Step 4 — Find Duplicate User Names

Notice:

"Amit"

appears twice.

Detect Duplicate Names

names = []

for user in data['users']:

    if user['name'] in names:
        print("Duplicate name:", user['name'])

    names.append(user['name'])

Step 5 — Find Duplicate Friend IDs

Sara has:

"friends": [2,2]

This means duplicate relationships exist.

Detect Duplicate Friends

for user in data['users']:

    if len(user['friends']) != len(set(user['friends'])):

        print("Duplicate friends found:", user['name'])

Understanding set()

set([2,2])

becomes:

{2}

set() automatically removes duplicates.

Step 6 — Find Duplicate Page IDs

Notice:

exists twice inside pages.

Duplicate IDs are dangerous in databases because IDs should always be unique.

Detect Duplicate Page IDs

page_ids = []

for page in data['pages']:

    if page['id'] in page_ids:

        print("Duplicate Page ID:", page)

    page_ids.append(page['id'])

Step 7 — Find Users with No Friends

Some users may have empty friend lists.

Detect Empty Friend Lists

for user in data['users']:

    if len(user['friends']) == 0:

        print(user['name'], "has no friends")

Step 8 — Clean Missing Names

Now we start cleaning the dataset.

Replace Empty Names

for user in data['users']:

    if user['name'] == "":

        user['name'] = "Unknown"

Why This Helps

Instead of keeping broken data:

""

we replace it with:

"Unknown"

This makes the dataset more structured and usable.

Step 9 — Remove Duplicate Friends

Cleaning Duplicate Relationships

for user in data['users']:

    user['friends'] = list(set(user['friends']))

Before Cleaning

[2,2]

After Cleaning

[2]

Step 10 — Remove Duplicate Page IDs

Professional systems always maintain unique IDs.

Remove Duplicate Pages

unique_pages = []

seen_ids = []

for page in data['pages']:

    if page['id'] not in seen_ids:

        unique_pages.append(page)

        seen_ids.append(page['id'])

data['pages'] = unique_pages

Step 11 — Validate Friend Relationships

What if a friend ID does not exist?

Professional Data Scientists validate relationships.

Create User ID Lookup

user_ids = []

for user in data['users']:

    user_ids.append(user['id'])

Validate Friend IDs

for user in data['users']:

    valid_friends = []

    for friend_id in user['friends']:

        if friend_id in user_ids:

            valid_friends.append(friend_id)

    user['friends'] = valid_friends

Step 12 — Create Final Cleaning Function

Now combine everything into a professional cleaning pipeline.

def clean_data(data):

    # CLEAN MISSING NAMES
    for user in data['users']:

        if user['name'] == "":
            user['name'] = "Unknown"

    # REMOVE DUPLICATE FRIENDS
    for user in data['users']:

        user['friends'] = list(set(user['friends']))

    # REMOVE DUPLICATE PAGE IDs
    unique_pages = []

    seen_ids = []

    for page in data['pages']:

        if page['id'] not in seen_ids:

            unique_pages.append(page)

            seen_ids.append(page['id'])

    data['pages'] = unique_pages

    return data

Step 13 — Run the Cleaning Function

cleaned_data = clean_data(data)

Step 14 — Save Cleaned Data

Professional Data Scientists always save cleaned datasets.

with open("cleaned_data.json", "w") as f:

    json.dump(cleaned_data, f, indent=4)

Final Cleaned Dataset Example

{
    "users": [
        {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
        {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
        {"id": 3, "name": "Unknown", "friends": [1], "liked_pages": [101, 103]},
        {"id": 4, "name": "Sara", "friends": [2], "liked_pages": [104]},
        {"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
    ],
    "pages": [
        {"id": 101, "name": "Python Developers"},
        {"id": 102, "name": "Data Science Enthusiasts"},
        {"id": 103, "name": "AI & ML Community"},
        {"id": 104, "name": "Web Dev Hub"}
    ]
}

What You Learned

Skill	Importance
Data Inspection	Understand dataset quality
Missing Value Handling	Improve reliability
Deduplication	Remove repeated records
Relationship Validation	Maintain database integrity
JSON Processing	Handle structured data
Data Cleaning Pipelines	Real-world Data Science workflow

Real Data Science Mindset

Weak programmers:

Just print data

Strong Data Scientists:

Inspect data
Audit quality
Clean datasets
Validate relationships
Structure information
Generate insights

Real Data Science is:

Raw Data → Clean Data → Insights → Decisions

Practice Tasks

Try solving these yourself:

Find the most liked page
Find users with the most friends
Replace friend IDs with friend names
Create page lookup dictionary
Generate analytics report
Count total friendships
Find inactive users

Final Thoughts

Data Cleaning is one of the most important skills in Data Science.

Before building Machine Learning models, professional Data Scientists spend most of their time:

understanding data
cleaning datasets
validating relationships
structuring information
generating business insights

If you master Data Cleaning and JSON processing using Python, your Data Science foundation becomes extremely strong.

Written by Aditya Data Scientist

Python Data Cleaning Tutorial: Learn Real-World JSON Cleaning Step-by-Step