WEBITYA Logo
Back to Articles
Data Science

Python Data Cleaning Tutorial: Learn Real-World JSON Cleaning Step-by-Step

Learn how to clean real-world JSON data in Python step-by-step. Master missing values, duplicate removal, relationship validation, and professional data cleaning workflows.
Aditya Data Scientist
Aditya Data Scientist
19 min
May 18, 2026
Python Data Cleaning Tutorial: Learn Real-World JSON Cleaning Step-by-Step

Python Data Cleaning Tutorial: Learn Real-World JSON Data Cleaning Step-by-Step

Data Cleaning is one of the most important skills in Data Science. Most beginners focus only on Machine Learning and AI, but in reality, professional Data Scientists spend most of their time cleaning, validating, structuring, and analyzing messy data.

In this tutorial, you will learn:

  • How to inspect messy JSON data
  • How to detect data problems
  • How to clean duplicate records
  • How to validate relationships
  • How to normalize datasets
  • How professional Data Scientists think

This guide is beginner-friendly and designed to help you build strong Data Science fundamentals using Python.

Why Data Cleaning is Important

Real-world data is never perfect.

Datasets often contain:

  • Missing values
  • Duplicate records
  • Invalid relationships
  • Empty fields
  • Broken IDs
  • Inconsistent formatting

If your data is not cleaned properly:

  • Analytics become wrong
  • Dashboards become unreliable
  • Machine Learning models fail
  • Business decisions become inaccurate

That is why Data Cleaning is one of the most valuable skills in Data Science.

Example JSON Dataset

We will work with this real-world style JSON dataset.

{
    "users": [
        {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
        {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
        {"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]},
        {"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]},
        {"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
    ],
    "pages": [
        {"id": 101, "name": "Python Developers"},
        {"id": 102, "name": "Data Science Enthusiasts"},
        {"id": 103, "name": "AI & ML Community"},
        {"id": 104, "name": "Web Dev Hub"},
        {"id": 104, "name": "Web Development"}
    ]
}

Step 1 — Load JSON Data in Python

First, import the JSON module and load the dataset.

import json

with open("data.json", "r") as f:
    data = json.load(f)

Understanding the Code

CodeMeaning
open()Opens the file
"r"Read mode
json.load()Converts JSON into Python dictionary
dataStores the dataset

Step 2 — Understand the Data Structure

Before cleaning data, professional Data Scientists inspect the structure carefully.

print(data.keys())

Output:

dict_keys(['users', 'pages'])

This means our dataset contains:

  • users
  • pages

Step 3 — Find Missing Names

One user has an empty name.

Problem:

{"id": 3, "name": ""}

Detect Missing Values

for user in data['users']:

    if user['name'] == "":
        print("Missing name:", user)

Why This Matters

Missing values can break:

  • reports
  • analytics
  • dashboards
  • Machine Learning systems

Step 4 — Find Duplicate User Names

Notice:

"Amit"

appears twice.

Detect Duplicate Names

names = []

for user in data['users']:

    if user['name'] in names:
        print("Duplicate name:", user['name'])

    names.append(user['name'])

Step 5 — Find Duplicate Friend IDs

Sara has:

"friends": [2,2]

This means duplicate relationships exist.

Detect Duplicate Friends

for user in data['users']:

    if len(user['friends']) != len(set(user['friends'])):

        print("Duplicate friends found:", user['name'])

Understanding set()

set([2,2])

becomes:

{2}

set() automatically removes duplicates.

Step 6 — Find Duplicate Page IDs

Notice:

104

exists twice inside pages.

Duplicate IDs are dangerous in databases because IDs should always be unique.

Detect Duplicate Page IDs

page_ids = []

for page in data['pages']:

    if page['id'] in page_ids:

        print("Duplicate Page ID:", page)

    page_ids.append(page['id'])

Step 7 — Find Users with No Friends

Some users may have empty friend lists.

Detect Empty Friend Lists

for user in data['users']:

    if len(user['friends']) == 0:

        print(user['name'], "has no friends")

Step 8 — Clean Missing Names

Now we start cleaning the dataset.

Replace Empty Names

for user in data['users']:

    if user['name'] == "":

        user['name'] = "Unknown"

Why This Helps

Instead of keeping broken data:

""

we replace it with:

"Unknown"

This makes the dataset more structured and usable.

Step 9 — Remove Duplicate Friends

Cleaning Duplicate Relationships

for user in data['users']:

    user['friends'] = list(set(user['friends']))

Before Cleaning

[2,2]

After Cleaning

[2]

Step 10 — Remove Duplicate Page IDs

Professional systems always maintain unique IDs.

Remove Duplicate Pages

unique_pages = []

seen_ids = []

for page in data['pages']:

    if page['id'] not in seen_ids:

        unique_pages.append(page)

        seen_ids.append(page['id'])

data['pages'] = unique_pages

Step 11 — Validate Friend Relationships

What if a friend ID does not exist?

Professional Data Scientists validate relationships.

Create User ID Lookup

user_ids = []

for user in data['users']:

    user_ids.append(user['id'])

Validate Friend IDs

for user in data['users']:

    valid_friends = []

    for friend_id in user['friends']:

        if friend_id in user_ids:

            valid_friends.append(friend_id)

    user['friends'] = valid_friends

Step 12 — Create Final Cleaning Function

Now combine everything into a professional cleaning pipeline.

def clean_data(data):

    # CLEAN MISSING NAMES
    for user in data['users']:

        if user['name'] == "":
            user['name'] = "Unknown"

    # REMOVE DUPLICATE FRIENDS
    for user in data['users']:

        user['friends'] = list(set(user['friends']))

    # REMOVE DUPLICATE PAGE IDs
    unique_pages = []

    seen_ids = []

    for page in data['pages']:

        if page['id'] not in seen_ids:

            unique_pages.append(page)

            seen_ids.append(page['id'])

    data['pages'] = unique_pages

    return data

Step 13 — Run the Cleaning Function

cleaned_data = clean_data(data)

Step 14 — Save Cleaned Data

Professional Data Scientists always save cleaned datasets.

with open("cleaned_data.json", "w") as f:

    json.dump(cleaned_data, f, indent=4)

Final Cleaned Dataset Example

{
    "users": [
        {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
        {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
        {"id": 3, "name": "Unknown", "friends": [1], "liked_pages": [101, 103]},
        {"id": 4, "name": "Sara", "friends": [2], "liked_pages": [104]},
        {"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
    ],
    "pages": [
        {"id": 101, "name": "Python Developers"},
        {"id": 102, "name": "Data Science Enthusiasts"},
        {"id": 103, "name": "AI & ML Community"},
        {"id": 104, "name": "Web Dev Hub"}
    ]
}

What You Learned

SkillImportance
Data InspectionUnderstand dataset quality
Missing Value HandlingImprove reliability
DeduplicationRemove repeated records
Relationship ValidationMaintain database integrity
JSON ProcessingHandle structured data
Data Cleaning PipelinesReal-world Data Science workflow

Real Data Science Mindset

Weak programmers:

  • Just print data

Strong Data Scientists:

  • Inspect data
  • Audit quality
  • Clean datasets
  • Validate relationships
  • Structure information
  • Generate insights

Real Data Science is:

Raw Data → Clean Data → Insights → Decisions

Practice Tasks

Try solving these yourself:

  1. Find the most liked page
  2. Find users with the most friends
  3. Replace friend IDs with friend names
  4. Create page lookup dictionary
  5. Generate analytics report
  6. Count total friendships
  7. Find inactive users

Final Thoughts

Data Cleaning is one of the most important skills in Data Science.

Before building Machine Learning models, professional Data Scientists spend most of their time:

  • understanding data
  • cleaning datasets
  • validating relationships
  • structuring information
  • generating business insights

If you master Data Cleaning and JSON processing using Python, your Data Science foundation becomes extremely strong.

Written by Aditya Data Scientist

Share:
#python#data cleaning#json#data science#json cleaning#data preprocessing#analytics#python tutorial#machine learning#aditya data scientist

Stay Updated

Get the latest insights on AI, Web Development, and Digital Marketing delivered straight to your inbox. Join thousands of developers and marketers.

Weekly Insights

Curated content every week with the latest trends and tutorials

Expert Content

Learn from industry experts and stay ahead of the curve

No Spam

Quality over quantity. Unsubscribe anytime with one click

Secure & Private
8,500+ Subscribers
Weekly Updates