What is Unicode normalization?
Unicode normalization is the process of transforming Unicode text into a standardized form to ensure consistent representation. This is important because many Unicode characters can be represented in multiple ways, which can cause issues with text comparison, searching, and sorting.
Examples of Unicode normalization
Original text
é (e + ´) - Composed vs. decomposed
ñ (n + ˜) - Composed vs. decomposed
ö (o + ¨) - Composed vs. decomposed
Å (A + ˚) - Composed vs. decomposed
fi (fi ligature) - Compatibility decomposition
½ (1/2 fraction) - Compatibility decomposition
① (circled digit) - Compatibility decomposition
hello (fullwidth) - Compatibility decompositionNormalized text (NFC)
é (e + ´) - Composed vs. decomposed
ñ (n + ˜) - Composed vs. decomposed
ö (o + ¨) - Composed vs. decomposed
Å (A + ˚) - Composed vs. decomposed
fi (fi ligature) - Compatibility decomposition
½ (1/2 fraction) - Compatibility decomposition
① (circled digit) - Compatibility decomposition
hello (fullwidth) - Compatibility decompositionUnicode Normalization Forms
| Form | Description | Example |
|---|---|---|
| NFC | Canonical Decomposition followed by Canonical Composition | Converts "e" + "´" to "é" (single character) |
| NFD | Canonical Decomposition | Converts "é" to "e" + "´" (two characters) |
| NFKC | Compatibility Decomposition followed by Canonical Composition | Converts "fi" (ligature) to "fi" (two characters) |
| NFKD | Compatibility Decomposition | Converts "½" to "1/2" (three characters) |
Key Concepts in Unicode Normalization
Canonical Equivalence
Characters that should be treated as the same for most purposes. For example, "é" (U+00E9) and "e" + "´" (U+0065 + U+0301) are canonically equivalent.
Compatibility Equivalence
Characters that represent the same abstract character but may have different visual appearances or behaviors. For example, "fi" (U+FB01) and "fi" (U+0066 + U+0069) are compatibility equivalent.
Composition
The process of combining multiple characters into a single precomposed character when possible.
Decomposition
The process of breaking down precomposed characters into their component parts.
When to use Unicode normalization
- When comparing strings that may contain different representations of the same characters
- When searching or indexing text that may contain diacritical marks
- When sorting text in a language-sensitive way
- When processing user input that may come from different sources or input methods
- When preparing text for storage in databases or files to ensure consistent representation
Choosing the Right Normalization Form
- NFC: Most commonly used for general text processing. Produces composed characters where possible, which is usually more compact and matches what users expect to see.
- NFD: Useful for operations that need to work with individual diacritical marks, such as certain types of sorting or searching.
- NFKC: Good for searching and indexing, as it normalizes compatibility characters to their canonical equivalents.
- NFKD: Similar to NFKC but with decomposed characters, useful for certain specialized text processing tasks.
Note:
While normalization is important for consistent text processing, it can change the visual appearance or meaning of text in some cases, especially when using NFKC or NFKD forms. Be careful when normalizing text that contains special symbols, mathematical notation, or text in languages that rely on specific character forms.
Other tools
Other tools available on this website: