DataScrub
Back to all guides

A Beginner's Guide to Data Profiling

Analysis

What is Data Profiling?

Data profiling is the process of examining your data and collecting statistics and informative summaries about it. Think of it as an X-ray for your spreadsheet. Before you start writing formulas or Python scripts to clean data, you need a high-level map of where the problems lie.

Key Metrics to Look For

  • Completeness: What percentage of cells in each column are null or empty? If a column is 90% empty, it might not be worth keeping.
  • Uniqueness: How many distinct values exist? If a "User ID" column has 10,000 rows but only 9,998 unique values, you have exactly two duplicate IDs that need investigating.
  • Type Consistency: Does a numeric column actually contain only numbers? Profiling reveals the hidden text strings ("N/A", "TBD") that break calculations.
  • Distribution and Outliers: What are the minimum, maximum, mean, and median values? An outlier detection algorithm (like IQR) can instantly highlight a $1,000,000 transaction in a dataset where the median is $50.

The DataScrub Profiler

While you can calculate these metrics manually using pivot tables, DataScrub's built-in Data Profiler generates a comprehensive health report in seconds, directly in your browser. It automatically flags mixed types, outliers, and duplicates, giving you a clear to-do list for your data cleaning process.