5 Signs Your Dataset is Ruining Your Analytics
Garbage In, Garbage Out
You can have the most beautifully designed Tableau dashboard in the world, but if the underlying data is flawed, your insights are worse than useless—they are actively misleading. Here are 5 signs that your dataset needs a trip through DataScrub.
1. The "Almost Duplicate" Problem
If your customer count seems inexplicably high, check for near-duplicates. "Apple Inc", "Apple Inc.", and "Apple Incorporated" will be counted as three separate entities by most BI tools. You need to normalize text case and strip punctuation before aggregating.
2. Impossible Dates
If your sales chart shows a massive spike in the year 1900, you have a date parsing issue. Excel handles dates notoriously poorly, often defaulting invalid dates to 1900-01-01. Always profile your date columns to check the minimum and maximum values.
3. The Mixed-Type Column
A column meant for revenue should only contain numbers. If a user typed "$1,000" (text) instead of 1000 (number), your SUM() function will silently ignore it. A good data profiler will flag any column that contains a mix of text and numeric types.
4. Suspiciously Perfect Data
Real-world data is messy. If a dataset has zero missing values and perfect distributions, it may have been overly aggressively imputed or generated synthetically. Always ask for the rawest form of the data possible.
5. Leading Zero Truncation
Zip codes in the US (like Boston's 02108) are often imported as numbers rather than text, resulting in the leading zero being dropped (2108). This ruins geospatial mapping. Always import identifiers as text, not numbers.