Text Encoding Nightmares: Fixing Scrambled Characters in CSVs
The Mojibake Mystery
Have you ever opened a CSV file only to find that "Café" has turned into "Café", or that curly quotes have been replaced by bizarre wingding symbols? This phenomenon is called "Mojibake," and it is the result of mismatched text encodings.
What is Text Encoding?
Computers only understand numbers. Text encoding is the rulebook that tells the computer which number corresponds to which letter. For decades, the western world used a rulebook called Windows-1252 (or ANSI). Today, the global standard is UTF-8, which can represent almost every character in every human language, plus emojis.
Why Excel Ruins CSVs
The problem usually starts with Microsoft Excel. When you double-click a CSV file, older versions of Excel often assume it is encoded in Windows-1252, even if it is actually UTF-8. If you save the file again, Excel permanently scrambles the non-standard characters.
How to Fix It
The best way to fix encoding issues is to prevent them. Always use the "Get Data from Text/CSV" import wizard in Excel, which allows you to explicitly set the "File Origin" to 65001 : Unicode (UTF-8).
If the damage is already done, DataScrub's Text Cleaner includes tools to strip unpredictable special characters, normalize unicode, and remove accents (e.g., converting "Café" to "Cafe") to ensure maximum compatibility with legacy systems.