Fixing Unicode normalization mismatches in NFC and NFD

Quick answer

💡Unicode normalization errors occur when two strings represent the same visible text using different code point sequences. The letter é can be either U+00E9 (precomposed, NFC form) or U+0065 + U+0301 (decomposed, NFD form). Normalize all strings to NFC before comparison or storage using str.normalize('NFC') in JavaScript or unicodedata.normalize('NFC', s) in Python.

Inspect encoded string bytes →

Error symptoms

✕Two strings that look identical in the UI are not equal under === comparison
✕A username or filename cannot be found even though it visually appears in the list
✕File exists on macOS but os.path.exists returns False on Linux for the same path
✕Database UNIQUE constraint violation when inserting a string that looks identical to an existing row
✕String comparison in passwords or tokens fails intermittently depending on the keyboard or OS used to input them
✕String.indexOf returns -1 for a substring that is visually present in the target string

Common causes

•macOS HFS+ and APFS normalize filenames to NFD; Linux ext4 stores filenames as bytes without normalization
•Text pasted from different operating systems or applications may carry different normalization forms
•Some keyboard layouts and input methods produce decomposed characters; others produce precomposed
•Database collation settings may not apply normalization before comparison
•JavaScript string comparison is binary (code unit by code unit) with no normalization
•Third-party APIs may return strings in a different normalization form than your application stores

When it happens

•When syncing files between macOS and Linux using Git, rsync, or cloud storage
•When storing user-provided names, addresses, or other text that includes accented characters
•When comparing authentication tokens or password-derived strings that contain non-ASCII characters
•When implementing search functionality that must find content regardless of how the user typed the query
•When a CI pipeline runs on Linux but developers work on macOS, causing test failures specific to one platform

Examples and fixes

NFC and NFD forms of the same character produce different byte sequences and fail equality checks.

Identical-looking strings that are not equal

❌ Wrong

const nfc = '\u00E9';         // e with acute: single precomposed code point
const nfd = '\u0065\u0301';   // e + combining acute accent (two code points)

console.log(nfc === nfd);       // false
console.log(nfc.length);        // 1
console.log(nfd.length);        // 2
// Database lookup using nfd will miss nfc-stored rows
const row = db.query('SELECT * FROM users WHERE name = ?', [nfd]);
// returns undefined even though 'é' is in the table

✅ Fixed

const nfc = '\u00E9';
const nfd = '\u0065\u0301';

// Normalize to NFC at input boundaries
function normalizeInput(str) {
  return typeof str === 'string' ? str.normalize('NFC') : str;
}

const query = normalizeInput(nfd); // now matches stored NFC value
const name  = normalizeInput(nfc); // unchanged, already NFC
console.log(query === name);        // true
// Store normalized value:
const row = db.query('SELECT * FROM users WHERE name = ?', [query]);

Unicode allows the same visible character to be encoded as a precomposed code point (NFC) or as a base character plus combining mark (NFD). JavaScript === compares code units, not visual equivalence, so nfc !== nfd even though they look identical. Calling normalize('NFC') on both strings collapses NFD sequences into precomposed form, making them byte-identical. Apply normalization at the earliest input boundary — form submission, API ingestion, or file upload — rather than at every comparison site, so the stored data is always in a consistent form.

macOS APFS returns NFD filenames; Linux ext4 returns filenames as stored bytes.

File path comparison between macOS and Linux

❌ Wrong

const fs = require('fs');
// File created on macOS: APFS stores as NFD (cafe + combining accent)
const files = fs.readdirSync('./uploads');
const target = 'caf\u00E9.txt'; // NFC: precomposed é
const found = files.find(f => f === target);
console.log(found); // undefined on macOS (NFD vs NFC mismatch)

✅ Fixed

const fs = require('fs');
const files = fs.readdirSync('./uploads');
const target = 'caf\u00E9.txt'; // NFC
// Normalize both sides to NFC before comparison
const found = files.find(f => f.normalize('NFC') === target.normalize('NFC'));
console.log(found); // 'café.txt' found correctly on both macOS and Linux
// For sorting:
files.sort((a, b) =>
  a.normalize('NFC').localeCompare(b.normalize('NFC'), undefined, { sensitivity: 'base' })
);

macOS APFS and HFS+ automatically decompose filenames to NFD when files are created. When readdir is called on macOS, the returned filenames are NFD-normalized. Linux ext4 stores filenames as raw bytes without applying any normalization, so NFC bytes stay NFC. Direct string comparison fails for accented filenames across platforms. The fix normalizes both the directory listing and the search target to NFC before comparison. This is idempotent on already-NFC strings and correctly converts NFD strings to their canonical precomposed form.

What causes Unicode normalization mismatches

The Unicode standard defines multiple canonical representations for characters that can be expressed as a base character plus one or more combining marks. The letter ñ can be the single code point U+00F1 (precomposed) or the sequence U+006E + U+0303 (Latin small letter n followed by combining tilde). Both are valid Unicode and both look identical. The choice between them is a normalization form.

Unicode defines four normalization forms. NFC (Canonical Decomposition followed by Canonical Composition) produces precomposed characters where possible — it is the compact form and the standard for web content, HTTP headers, and most databases. NFD (Canonical Decomposition) decomposes all characters to their base plus combining marks — it is the form that macOS HFS+ and APFS use for filenames. NFKC and NFKD additionally apply compatibility mappings, collapsing visually similar characters like the fraction one-half to '1/2'. NFKC is used by some search engines and identifier parsers where compatibility equivalence matters more than round-trip fidelity.

The practical problem is that these forms are not equal under byte comparison. When a macOS user creates a file named 'résumé.pdf' and a Linux service tries to find it by the same NFC string, the comparison fails because macOS returned an NFD filename. Neither is wrong from a Unicode perspective, but the application has no normalization layer to reconcile them.

Text pasted from different sources is another normalization mismatch source. Text copied from a PDF may be in NFD if the PDF renderer decomposed characters. Text typed on a French keyboard on Windows is in NFC. Text entered via iOS keyboard may differ from macOS. When users paste a string into a search field that is compared against database-stored NFC values, the search finds nothing if the pasted text happens to be NFD.

Diagnosing normalization form differences

The most direct diagnostic is to print the code point sequence of the string rather than the rendered text. In JavaScript: [...str].map(c => 'U+' + c.codePointAt(0).toString(16).toUpperCase().padStart(4,'0')).join(' '). An NFC é prints as U+00E9. An NFD é prints as U+0065 U+0301. If you see sequences of base character plus combining mark (U+0300 through U+036F range) where you expected precomposed characters, the string is in NFD or has been partially decomposed.

In Python 3, use import unicodedata; [unicodedata.name(c) for c in s] to print the Unicode name of each code point. This shows 'COMBINING ACUTE ACCENT' when a combining mark is present. The unicodedata.is_normalized('NFC', s) function (Python 3.8+) returns True or False without modifying the string, which is useful for logging or assertions.

For database comparison issues, run the comparison directly in SQL. In PostgreSQL with ICU: SELECT 'e\u0301' = 'é' COLLATE "und-x-icu". The und-x-icu collation performs Unicode canonical equivalence comparison. The default C and POSIX collations are binary and return false for NFD vs NFC comparisons.

For macOS vs Linux file path issues, print the hex bytes of the filename. In Bash: ls uploads/ | xxd | head. On macOS you will see decomposed byte sequences for accented characters (65 cc 81 for NFD é). On Linux you will see precomposed bytes (c3 a9 for NFC é in UTF-8). This confirms the platform-specific normalization behavior.

Normalizing at the right application boundaries

The correct fix is to normalize all strings to NFC at the points where external input enters your system. This means normalizing at form submission handlers, at file upload processing, at API ingestion points, and at search query preprocessing. Once data is normalized to NFC at entry, all downstream comparison and storage operations work correctly without further normalization overhead.

In JavaScript use str.normalize('NFC') which is available in all modern browsers and Node.js 0.12+. For filenames from the filesystem, normalize after readdir: files.map(f => f.normalize('NFC')). For search queries, normalize the query and also ensure stored values were normalized at write time. For user-facing input, normalize in a centralized input handler or Express middleware.

In Python 3 use unicodedata.normalize('NFC', s). For file operations normalize paths: unicodedata.normalize('NFC', filename) before constructing the full path. For Django and Flask, apply normalization in form validation or a request middleware class.

For database storage, the safest approach combines application-level NFC normalization with a database collation that performs canonical equivalence. In PostgreSQL with ICU collation, two canonically equivalent strings compare as equal even without application normalization, providing a safety net. In MySQL, utf8mb4_unicode_ci performs case-insensitive Unicode comparison but may not perform full canonical normalization in all versions. Treat application-level normalization as the primary safeguard and the database collation as a secondary check.

For the macOS and Linux cross-platform file sync problem, normalize filenames to NFC before any filesystem operations that compare against user-provided or database-stored values. When building paths from user input or database values always normalize: path.join(uploadDir, filename.normalize('NFC')).

NFKC data loss and other normalization edge cases

NFKC normalization can cause unexpected data loss in some contexts. NFKC maps compatibility equivalents: the copyright symbol (U+00A9) becomes '(c)', the non-breaking space (U+00A0) becomes a regular space, the subscript 2 in chemical formulas (U+2082) becomes the digit '2', and ligatures like 'fi' (U+FB01) are split into two letters. If your application uses these characters intentionally — in a chemistry editor, a legal document tool, or a typography system — applying NFKC would corrupt the data. Use NFC for storage and NFKC only for search normalization where compatibility equivalence is desirable.

Strings that are already in NFC form do not change when normalize('NFC') is called. This makes NFC normalization idempotent and safe to apply repeatedly. However, NFC normalization is not fully symmetrical with NFD: normalizing an NFC string to NFD and back to NFC should round-trip correctly, but some code points have normalization exclusions that prevent re-composition. These affect a small number of Korean Hangul characters and certain compatibility characters documented in the Unicode standard's normalization exclusions table.

Some scripts do not have precomposed forms and are always decomposed in canonical form. Arabic and Hebrew combining vowel marks, Devanagari vowel signs, and Thai characters that combine with consonants do not have single precomposed code points. For these scripts NFC and NFD produce identical output, so normalization form differences cannot cause comparison failures. NFKC may still differ from NFC for some compatibility characters in these ranges.

Password normalization requires special care under the PRECIS framework (RFC 8264). The SASLprep profile (used in SASL and many authentication protocols) applies NFC normalization to passwords. If a user sets a password on a macOS device that produces NFD characters and then authenticates from a Linux device that produces NFC input, both must normalize to NFC before hashing to produce the same hash. Failure to normalize at both password setting and verification time causes intermittent authentication failures on cross-platform deployments.

Common normalization implementation mistakes

Applying normalization at comparison time rather than at write time is the most common mistake. If normalization is only applied during lookup but not during storage, the database accumulates both NFC and NFD forms. Future comparisons that apply normalization at query time will match both forms, but direct SQL WHERE clauses without normalization will miss NFD entries. The correct approach is to normalize at the earliest opportunity: when data enters the system through API, form, or file upload.

Using the wrong normalization form for the use case is another common mistake. Using NFD for storage instead of NFC causes problems when text is processed by tools that do not normalize before comparison, because most tools default to binary or NFC comparison. NFC is the web standard: W3C recommends NFC for all web content, and most databases, HTTP headers, and JSON parsers assume NFC. NFD is only appropriate when interoperating with macOS filesystem APIs or other systems that specifically produce NFD output.

Assuming that normalize() handles all equivalence cases is incorrect. Unicode canonical equivalence handled by NFC and NFD and compatibility equivalence handled by NFKC and NFKD are the two defined Unicode equivalence relations. But there are other levels that normalize() does not address: case equivalence (A vs a), Unicode confusables (l vs I vs 1), variant selectors, and ligature decompositions. For security-sensitive comparisons like usernames, also apply case folding and confusable character detection as defined in Unicode Technical Standard 39.

Not testing with actual non-ASCII input that includes combining marks is why these bugs survive code review. Add test cases using café, résumé, naïve, Zürich, and names from Arabic, Hebrew, Thai, and Korean to your test suite. Each of these will expose normalization, length, sorting, and slicing bugs that ASCII-only tests miss entirely.

Building normalization into your application layer

Establish a single normalization policy for your entire application and document it. The recommended policy is: normalize all text to NFC at input boundaries, store NFC in the database, compare NFC strings directly. This means every string in the system is in a known canonical form and comparisons are reliable without additional runtime normalization overhead.

Create a utility function that enforces normalization and apply it in middleware: function normalizeText(str) { return typeof str === 'string' ? str.normalize('NFC') : str; }. In Express, apply this in a body parser middleware. In Django or Flask, apply it in a form clean method or a custom middleware class. For REST APIs, apply normalization in the request deserialization layer before any business logic runs.

For search functionality, apply NFKC additionally for compatibility equivalence: query.normalize('NFKC').toLowerCase(). This handles both normalization and basic case-folding, so a search for 'cafe' finds 'café' and 'CAFÉ'. Store the normalized search form in a separate indexed column if you need high-performance lookup, and the original NFC form in the display column.

For file system operations on cross-platform codebases, normalize all paths that originate from user input or database retrieval to NFC before any filesystem operation. Do not normalize paths that come directly from fs.readdir unless you are comparing them against user-provided values. For internationalized applications, integrate with the ICU normalization library through the full-icu Node.js build which includes ICU's normalization, collation, and case folding algorithms, handling all edge cases including normalization exclusions and Hangul composition rules correctly.

Quick fix checklist

✓Print code points with [...str].map(c => c.codePointAt(0).toString(16)) to detect combining marks in NFD strings.
✓Apply str.normalize('NFC') to all text at input boundaries before storage or comparison.
✓Normalize filenames after fs.readdirSync on macOS to convert NFD to NFC.
✓Normalize search queries to NFC or NFKC before comparing against stored values.
✓Check database collation: use ICU collations in PostgreSQL for canonical equivalence.
✓Add test fixtures with café, résumé, naïve to catch normalization regressions.
✓Normalize passwords to NFC at both registration and login to prevent cross-platform auth failures.
✓Avoid NFKC for storage in applications that use copyright symbols, superscripts, or ligatures.

Related guides

Frequently asked questions

Why do two identical-looking strings fail equality check in JavaScript?

JavaScript === comparison is binary: it compares UTF-16 code units, not visual equivalence. A character like é can be encoded as one precomposed code point (NFC) or as a base character plus combining accent mark (NFD). Both look identical but have different code unit sequences and different lengths. Call str.normalize('NFC') on both strings before comparing to collapse them to the same canonical form.

What is the difference between NFC, NFD, NFKC, and NFKD?

NFC and NFD apply canonical equivalence: NFC composes decomposed sequences into precomposed characters, NFD decomposes them. NFKC and NFKD additionally apply compatibility mappings, converting ligatures, fractions, and width variants to simpler forms. Use NFC for storage and general comparison. Use NFKC for search normalization. Avoid NFKC for storage because it loses information by collapsing distinct characters.

Why does a file exist on macOS but not on Linux with the same name?

macOS APFS normalizes filenames to NFD when files are created. Linux ext4 stores filenames as raw bytes without normalization. Code that constructs an NFC path from a database value and tries to find the NFD filesystem entry fails. Normalize both the directory listing and the target name to NFC before comparison: files.find(f => f.normalize('NFC') === target.normalize('NFC')).

Does calling normalize('NFC') on an already-NFC string cause any issues?

No. NFC normalization is idempotent: calling it on an already-NFC string returns the same string unchanged. It is safe to normalize every string at input boundaries without checking whether normalization is already applied. The JavaScript engine detects that no changes are needed and returns efficiently. Apply it unconditionally rather than conditionally checking first.

How does Unicode normalization affect password hashing?

If a password contains accented characters and is entered on macOS in NFD form but verified against a hash from NFC input on another platform, verification fails. Apply NFC normalization to the password string before passing it to the hash function at both registration and login time. This ensures consistent bytes regardless of which keyboard or OS the user typed on. Never normalize after hashing, only before.

What are combining characters and which Unicode range are they in?

Combining characters attach to the preceding base character to modify its appearance. The primary range is U+0300 to U+036F (combining diacritical marks). Additional ranges include U+1DC0-U+1DFF, U+20D0-U+20FF, and script-specific ranges for Arabic, Hebrew, and Devanagari. Their presence in a string is the signature of NFD or partially decomposed text and causes binary equality comparisons to fail against NFC strings.

Does MySQL unicode_ci collation handle normalization?

MySQL utf8mb4_unicode_ci performs case-insensitive comparison using Unicode rules but does not apply full canonical normalization. Two strings differing only in NFC vs NFD form may compare as unequal. Apply NFC normalization in application code before inserting or querying. PostgreSQL with ICU collation using 'und-x-icu' performs canonical equivalence comparison at the database level as an additional safety net.

Can NFKC normalization lose data?

Yes. NFKC collapses compatibility variants: the copyright symbol becomes '(c)', non-breaking space becomes regular space, superscript digits become regular digits, and ligatures are split into separate letters. For chemistry, math, or legal document applications where these characters carry specific meaning, NFKC is destructive. Use NFC for storage and apply NFKC only transiently for search query normalization where compatibility equivalence is desired.

All tools run in your browser. Your data never leaves your device. Last updated: 2026-05-06.