What is Data Profiling?

Data profiling is the process of examining your data and collecting statistics and informative summaries about it. Think of it as an X-ray for your spreadsheet. Before you start writing formulas or Python scripts to clean data, you need a high-level map of where the problems lie.

Key Metrics to Look For

Completeness: What percentage of cells in each column are null or empty? If a column is 90% empty, it might not be worth keeping.
Uniqueness: How many distinct values exist? If a "User ID" column has 10,000 rows but only 9,998 unique values, you have exactly two duplicate IDs that need investigating.
Type Consistency: Does a numeric column actually contain only numbers? Profiling reveals the hidden text strings ("N/A", "TBD") that break calculations.
Distribution and Outliers: What are the minimum, maximum, mean, and median values? An outlier detection algorithm (like IQR) can instantly highlight a $1,000,000 transaction in a dataset where the median is $50.

The DataScrub Profiler

While you can calculate these metrics manually using pivot tables, DataScrub's built-in Data Profiler generates a comprehensive health report in seconds, directly in your browser. It automatically flags mixed types, outliers, and duplicates, giving you a clear to-do list for your data cleaning process.