WEBITYA Logo
Back to Articles
Data Science

Mastering Data Cleaning in Python: Real-World JSON Cleaning & Analytics Guide for Beginners

Learn how professional Data Scientists clean messy JSON data using Python. This complete beginner-to-advanced guide covers JSON handling, missing values, normalization, validation, data transformation, analytics, and real-world data cleaning workflows with practical examples.
Aditya Data Scientist
Aditya Data Scientist
11 min
May 17, 2026
Mastering Data Cleaning in Python: Real-World JSON Cleaning & Analytics Guide for Beginners

Mastering Data Cleaning in Python: A Real-World Beginner to Professional Guide

Data cleaning is one of the most important skills in Data Science. Most beginners think Data Science is only about Machine Learning and AI, but in reality, professional Data Scientists spend most of their time cleaning, structuring, and validating messy data.

In this blog, we will learn:

  • How to load JSON data
  • How to inspect messy datasets
  • How to clean real-world data
  • How to generate insights
  • How to save cleaned data
  • How professional Data Scientists think

Why Data Cleaning Matters

Real-world data is never perfect.

You will encounter:

  • Missing values
  • Invalid emails
  • Wrong data types
  • Duplicate values
  • Mixed date formats
  • Inconsistent capitalization
  • Broken records
  • Nested JSON structures

If your data is not cleaned properly:

  • Analytics become wrong
  • Dashboards become unreliable
  • Machine Learning models fail
  • Business decisions become inaccurate

That is why Data Cleaning is one of the most valuable skills in Data Science.

Step 1 — Create a Realistic JSON Dataset

Save this file as:

company_data.json
{
  "employees": [

    {
      "id": "E101",
      "name": " Amit Sharma ",
      "email": "AMIT@example.COM ",
      "age": "24",
      "salary": "55000",
      "department": "it",
      "city": "delhi ",
      "skills": ["Python", "SQL", "python"],
      "projects_completed": "12",
      "is_active": "TRUE"
    },

    {
      "id": "E102",
      "name": "",
      "email": "priyaexample.com",
      "age": null,
      "salary": null,
      "department": "HR",
      "city": "Mumbai",
      "skills": ["Excel", "Communication"],
      "projects_completed": "five",
      "is_active": "False"
    },

    {
      "id": "E103",
      "name": "RAHUL",
      "email": "rahul@example.com",
      "age": "twenty six",
      "salary": "70000",
      "department": null,
      "city": "BANGALORE",
      "skills": [],
      "projects_completed": "-3",
      "is_active": true
    }
  ]
}

Step 2 — Load the JSON File

import json

def load_data(filename):

    with open(filename, "r") as f:
        return json.load(f)

data = load_data("company_data.json")

Understanding the Code

json.load(f)

Converts JSON file data into Python dictionaries and lists.

with open()

Safely opens and closes the file automatically.

return

Returns data back so it can be reused.

Step 3 — Inspect Raw Data

Professional Data Scientists NEVER clean blindly.

First, inspect the dataset.

print(type(data))
print(data.keys())
print(len(data['employees']))
print(data['employees'][0])

This helps us understand:

  • Structure
  • Columns
  • Missing values
  • Data quality
  • Nested relationships

Step 4 — Detect Missing Values

for emp in data['employees']:

    for key, value in emp.items():

        if value is None or value == "":
            print(emp['id'], "missing", key)

Why This Matters

Missing values are one of the biggest problems in real-world datasets.

Examples:

  • Missing salary
  • Missing email
  • Missing age
  • Empty names

Professionals audit datasets before cleaning them.

Step 5 — Create the Cleaning Function

Now we create a professional cleaning pipeline.

def clean_data(data):

    cleaned_employees = []

    for emp in data['employees']:

        # CLEAN NAME
        name = emp['name'].strip().title()

        if name == "":
            name = "Unknown"

        # CLEAN EMAIL
        email = emp['email'].strip().lower()

        if "@" not in email:
            email = None

        # CLEAN AGE
        try:
            age = int(emp['age'])
        except:
            age = None

        # CLEAN SALARY
        try:
            salary = float(emp['salary'])
        except:
            salary = 0

        # CLEAN DEPARTMENT
        dept = emp['department']

        if dept:
            dept = dept.upper()

        # CLEAN CITY
        city = emp['city'].strip().title()

        # CLEAN SKILLS
        skills = []

        for skill in emp['skills']:
            skills.append(skill.lower())

        skills = list(set(skills))

        # CLEAN PROJECTS
        try:
            projects = int(emp['projects_completed'])

            if projects < 0:
                projects = 0

        except:
            projects = 0

        # CLEAN BOOLEAN
        active = str(emp['is_active']).lower()

        is_active = active in ['true', '1', 'yes']

        cleaned_employees.append({

            "id": emp['id'],
            "name": name,
            "email": email,
            "age": age,
            "salary": salary,
            "department": dept,
            "city": city,
            "skills": skills,
            "projects_completed": projects,
            "is_active": is_active

        })

    return {"employees": cleaned_employees}

Step 6 — Run the Cleaning Pipeline

cleaned_data = clean_data(data)

At this stage:

  • Invalid emails become None
  • Names are standardized
  • Salary becomes numeric
  • Duplicate skills are removed
  • Boolean values become consistent
  • Negative project counts are fixed

Step 7 — Generate Business Insights

This is where cleaned data becomes valuable.

Total Employees

print("Total Employees:", len(cleaned_data['employees']))

Average Salary

total_salary = 0
count = 0

for emp in cleaned_data['employees']:

    if emp['salary'] > 0:
        total_salary += emp['salary']
        count += 1

average_salary = total_salary / count

print("Average Salary:", average_salary)

Most Skilled Employee

max_skills = 0
top_employee = ""

for emp in cleaned_data['employees']:

    if len(emp['skills']) > max_skills:

        max_skills = len(emp['skills'])
        top_employee = emp['name']

print(top_employee, "has most skills")

Active Employees

for emp in cleaned_data['employees']:

    if emp['is_active']:
        print(emp['name'], "is active")

Step 8 — Save the Cleaned Dataset

Professional workflows always save cleaned datasets.

def save_data(data, filename):

    with open(filename, "w") as f:

        json.dump(data, f, indent=4)

save_data(cleaned_data, "cleaned_company_data.json")

This creates a new clean JSON file ready for:

  • Analytics
  • Dashboards
  • Machine Learning
  • APIs
  • Reporting

Key Data Cleaning Concepts You Learned

ConceptPurpose
.strip()Remove extra spaces
.lower()Normalize text
.title()Standardize capitalization
try/exceptHandle invalid data safely
set()Remove duplicates
ValidationEnsure data quality
Type conversionConvert strings to numbers
Boolean normalizationStandardize True/False

How Professional Data Scientists Think

Weak programmers:

  • Just print data

Strong Data Scientists:

  • Inspect data
  • Validate data
  • Normalize formats
  • Detect anomalies
  • Build pipelines
  • Generate insights

Real Data Science is:

Raw Data → Clean Data → Insights → Decisions

Final Thoughts

If you want to become a strong Data Scientist, master:

  • JSON handling
  • Data structures
  • Cleaning pipelines
  • Validation
  • Relationships
  • Data transformations

Machine Learning becomes much easier once your data understanding becomes strong.

Remember:

The quality of your model depends on the quality of your data.

Clean data creates powerful systems.

Practice Tasks

Try solving these yourself:

  1. Find invalid emails
  2. Count employees department-wise
  3. Find highest salary employee
  4. Count skill frequency
  5. Find employees with missing age
  6. Create an analytics report dictionary
  7. Export cleaned data to a new JSON file

Written by Aditya Data Scientist

Share:
#Python#Data Cleaning#JSON#Data Science#Data Analysis#Python Tutorial#Machine Learning#Data Processing#Data Analytics#Beginner Python#Real World Data#Data Engineering#JSON Cleaning#Python JSON

Stay Updated

Get the latest insights on AI, Web Development, and Digital Marketing delivered straight to your inbox. Join thousands of developers and marketers.

Weekly Insights

Curated content every week with the latest trends and tutorials

Expert Content

Learn from industry experts and stay ahead of the curve

No Spam

Quality over quantity. Unsubscribe anytime with one click

Secure & Private
8,500+ Subscribers
Weekly Updates