Diagnosing and fixing UTF-8 vs UTF-16 encoding mismatches

Quick answer

πŸ’‘Encoding mismatches between UTF-8 and UTF-16 produce mojibake β€” garbled text where characters appear as sequences of Latin accented characters, boxes, or question marks. Detect the encoding by inspecting the first bytes for a BOM (FF FE = UTF-16 LE, FE FF = UTF-16 BE, EF BB BF = UTF-8 BOM). In Node.js, use Buffer.from(str, 'utf16le') or 'utf8'. In Python, pass encoding='utf-8' or encoding='utf-16' to open().

Error symptoms

  • βœ•Text appears as sequences of accented Latin characters when opened in an editor (classic mojibake)
  • βœ•Python UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0
  • βœ•Node.js Buffer output contains null bytes between every character for ASCII text
  • βœ•JSON.parse fails with SyntaxError: Unexpected token for a file that looks valid in the editor
  • βœ•CSV file opens in Excel with Chinese characters replaced by boxes or question marks
  • βœ•First character of a file is the Unicode replacement character U+FFFD or a BOM character

Common causes

  • β€’Reading a UTF-16 encoded file with a UTF-8 decoder, producing one garbled byte per two-byte code unit
  • β€’Windows Notepad and older Windows tools defaulting to UTF-16 LE with BOM for saved files
  • β€’JavaScript charCodeAt returning only the high surrogate value for characters above U+FFFF
  • β€’Node.js Buffer.from(str, 'binary') treating character codes as raw bytes, corrupting non-ASCII text
  • β€’Python open() using the system default encoding (often ASCII or latin-1 on older systems) instead of UTF-8
  • β€’Confusing UTF-16 code units (2 bytes) with Unicode code points, causing off-by-one errors in string processing

When it happens

  • β€’When reading files generated by Windows applications like Notepad, Excel, or Visual Studio that may save as UTF-16
  • β€’When processing data from Windows Registry exports, Windows event log files, or .NET byte array outputs
  • β€’When a web API returns text in a charset other than UTF-8 and the HTTP client ignores the Content-Type charset
  • β€’When migrating data between a Windows-based system and a Linux-based system without re-encoding
  • β€’When processing internationalized strings in JavaScript and using charCodeAt instead of codePointAt for non-BMP characters

Examples and fixes

Automatically detect the BOM and decode with the correct encoding.

Detecting and reading a UTF-16 LE file in Node.js

❌ Wrong

const fs = require('fs');
// File saved by Windows Notepad as UTF-16 LE with BOM
// Byte sequence: FF FE 48 00 65 00 6C 00 6C 00 6F 00
const content = fs.readFileSync('windows-file.txt', 'utf8');
console.log(content);
// Outputs: '\uFFFDH\x00e\x00l\x00l\x00o\x00' or garbled
console.log(JSON.parse(content)); // SyntaxError

βœ… Fixed

const fs = require('fs');

function readFileAutoDetect(filePath) {
  const raw = fs.readFileSync(filePath);
  // UTF-16 LE BOM: FF FE
  if (raw[0] === 0xFF && raw[1] === 0xFE) {
    return raw.toString('utf16le').replace(/^\uFEFF/, '');
  }
  // UTF-16 BE BOM: FE FF
  if (raw[0] === 0xFE && raw[1] === 0xFF) {
    return raw.swap16().toString('utf16le').replace(/^\uFEFF/, '');
  }
  // UTF-8 BOM: EF BB BF
  if (raw[0] === 0xEF && raw[1] === 0xBB && raw[2] === 0xBF) {
    return raw.slice(3).toString('utf8');
  }
  return raw.toString('utf8'); // default
}

console.log(readFileAutoDetect('windows-file.txt')); // 'Hello'

UTF-16 LE files start with the byte order mark FF FE followed by null-byte-padded ASCII characters (H encodes as 48 00, e as 65 00, etc.). Reading this as UTF-8 interprets the FF byte as an invalid lead byte (producing the replacement character) and then reads each null byte as a UTF-8 null, producing interleaved garbage. The fix detects the BOM by inspecting the first two raw bytes before choosing the encoding. UTF-16 BE (FE FF BOM) requires byte-swapping with raw.swap16() before decoding as utf16le because Node.js's utf16le is always little-endian.

charCodeAt returns the surrogate value, not the actual code point, for emoji and other high Unicode characters.

JavaScript codePointAt vs charCodeAt for non-BMP characters

❌ Wrong

const text = 'Hello \u{1F525}'; // 'Hello ' + fire emoji
// charCodeAt on a surrogate pair returns only the high surrogate
console.log(text.charCodeAt(6));      // 55357 (0xD83D β€” high surrogate)
console.log(text.charCodeAt(7));      // 56613 (0xDD25 β€” low surrogate)
// Building a UTF-16 buffer manually β€” common in legacy code
const buf = Buffer.alloc(text.length * 2);
for (let i = 0; i < text.length; i++) {
  buf.writeUInt16LE(text.charCodeAt(i), i * 2);
}

βœ… Fixed

const text = 'Hello \u{1F525}'; // 'Hello ' + fire emoji (U+1F525)
// codePointAt correctly returns the full code point
console.log(text.codePointAt(6)); // 128293 (0x1F525 β€” actual code point)
console.log(text.codePointAt(7)); // 56613 (low surrogate β€” still there)

// Safe iteration over code points:
const codePoints = [];
for (const char of text) { // for...of iterates by code point
  codePoints.push(char.codePointAt(0));
}
console.log(codePoints); // [72, 101, 108, 108, 111, 32, 128293]

// For UTF-16 LE encoding to Buffer:
const buf = Buffer.from(text, 'utf16le');
console.log(buf.length); // 16 bytes (8 chars * 2 bytes + surrogate pair * 2 bytes)

JavaScript strings are UTF-16 sequences. Characters outside the Basic Multilingual Plane (emoji, many CJK extension characters) are stored as surrogate pairs β€” two 16-bit code units that together represent one code point. charCodeAt returns the raw 16-bit code unit value, so for a surrogate pair it returns the high surrogate value (0xD800-0xDBFF) rather than the actual character code point. codePointAt combines the surrogate pair and returns the correct code point. The for...of loop iterates by code points, not code units, so each emoji counts as one iteration step.

Why UTF-8 and UTF-16 mismatches corrupt text

UTF-8 and UTF-16 encode the same Unicode code points using fundamentally different byte representations. UTF-8 uses 1-4 bytes per code point, with ASCII characters occupying exactly one byte. UTF-16 uses 2 bytes for code points in the Basic Multilingual Plane (U+0000 to U+FFFF) and 4 bytes for code points above U+FFFF. When a program reads UTF-16 bytes and interprets them as UTF-8, the result is mojibake because the null bytes that pad ASCII characters in UTF-16 (H in UTF-16 LE is 48 00) are treated as the NUL character in UTF-8.

The Byte Order Mark (BOM) is a mechanism to identify the encoding of a file. A UTF-16 LE file begins with the bytes FF FE, UTF-16 BE with FE FF, and optionally UTF-8 with EF BB BF. When a UTF-16 file without a BOM is read, the program must guess the encoding based on the content, which may fail for short files or files containing only ASCII characters. Windows applications including Notepad, Visual Studio, and Excel frequently produce UTF-16 LE files with BOM when saving non-ASCII content.

In JavaScript, the problem is compounded by the language's internal string representation. JavaScript strings are sequences of 16-bit code units (UTF-16). The length property counts code units, not Unicode code points or visible characters. The charCodeAt method returns a 16-bit code unit value, which is the high surrogate value (in the range 0xD800-0xDBFF) for the first half of a surrogate pair, not the actual code point. codePointAt was introduced in ES6 to provide the correct code point value by combining surrogate pairs.

In Node.js, the Buffer object's encoding argument controls how bytes are interpreted. Buffer.from(str, 'utf8') and Buffer.from(str, 'utf16le') produce completely different byte sequences for the same string. Using the wrong encoding to decode a Buffer produces garbled output. The 'binary' encoding (alias for 'latin1') treats each byte as a single character, which produces incorrect results for any character above U+00FF and should not be used for Unicode text.

Detecting the encoding of a file or byte sequence

The most reliable way to detect encoding is to inspect the BOM bytes. In Node.js: const raw = fs.readFileSync(path); console.log(raw.slice(0, 4).toString('hex')). Compare the first bytes: ef bb bf is a UTF-8 BOM, ff fe is UTF-16 LE, fe ff is UTF-16 BE. If there is no BOM, look at the pattern of bytes for text content: a UTF-16 LE ASCII text file will show alternating content bytes and null bytes (48 00 65 00 = 'He'), while UTF-8 will show only content bytes without nulls.

In Python, the chardet library provides encoding detection: import chardet; result = chardet.detect(open('file', 'rb').read(10000)); print(result). For short files chardet can be wrong, so for production code where the encoding source is known (such as files from a specific Windows application) it is better to document and hard-code the encoding assumption rather than auto-detecting.

For JavaScript string inspection, print the code units: Array.from({length: str.length}, (_, i) => str.charCodeAt(i).toString(16)).join(' '). If you see values in the range d800-dbff followed by dc00-dfff, those are surrogate pairs for non-BMP characters. If you see many 0000 values, the string may have been constructed from UTF-16 bytes with null padding.

For HTTP responses, the encoding is declared in the Content-Type header: Content-Type: text/html; charset=UTF-16. The fetch API's response.text() method reads the charset from the Content-Type header and decodes accordingly. If the header is wrong or missing, response.text() defaults to UTF-8. Axios similarly reads charset but may handle edge cases differently. Always check that the Content-Type of the response matches your decoding assumption.

Correcting UTF-8 and UTF-16 encoding handling

In Node.js, always pass the explicit encoding to Buffer and fs methods. Use fs.readFileSync(path, 'utf8') for UTF-8 files. For UTF-16 LE files, read as a Buffer first and convert: Buffer.from(rawBuffer.buffer, rawBuffer.byteOffset, rawBuffer.byteLength).toString('utf16le'). The iconv-lite package provides broader encoding support including UTF-16 BE, which Node.js's built-in 'utf16le' does not handle: require('iconv-lite').decode(rawBuffer, 'UTF-16BE').

In Python 3, always pass the encoding parameter to open(): open('file.txt', 'r', encoding='utf-8'). For files that may have a BOM, use the special encoding name 'utf-8-sig' which automatically strips the UTF-8 BOM if present. For UTF-16 files, use encoding='utf-16' which auto-detects the byte order from the BOM, or 'utf-16-le' or 'utf-16-be' for explicit byte order. The io.open function (which is the same as the built-in open in Python 3) supports these encoding names.

For JavaScript string operations on non-BMP characters, replace charCodeAt with codePointAt and use the for...of loop or Array.from for iteration. When building a Buffer from a JavaScript string that may contain emoji or other non-BMP characters, use Buffer.from(str, 'utf8') rather than manually iterating with charCodeAt.

For converting a UTF-16 file to UTF-8 on the command line, use iconv: iconv -f utf-16le -t utf-8 input.txt > output.txt. On macOS, iconv is built in. On Linux it is available via libiconv. For batch conversion, find all UTF-16 files and convert: find . -name '*.txt' | xargs -I{} sh -c 'iconv -f utf-16le -t utf-8 "{}" > "{}.utf8" && mv "{}.utf8" "{}".

BOM handling, surrogate pairs, and Windows-specific issues

The UTF-8 BOM (EF BB BF) is a frequently misunderstood edge case. While it is technically valid Unicode, it is not recommended for UTF-8 by the Unicode standard because UTF-8 has no byte order ambiguity. However, many Windows tools insert it, and many Unix tools do not expect it. A JSON file with a UTF-8 BOM will fail to parse in Python with json.load(open('file.json')) because the JSON parser sees a non-whitespace character before the first { or [. The fix is to use encoding='utf-8-sig' in the open() call, or to explicitly strip the BOM: text.lstrip('\ufeff').

Unpaired surrogates are another edge case. A valid UTF-16 string must pair each high surrogate (U+D800-U+DBFF) with a following low surrogate (U+DC00-U+DFFF). However, JavaScript allows strings to contain unpaired surrogates, which are called lone surrogates. Functions like encodeURIComponent throw a URIError when called with a string containing a lone surrogate. TextEncoder throws for lone surrogates in some environments. JSON.stringify produces a malformed JSON string for a string with lone surrogates. If your code receives strings from external sources, validate that they contain no lone surrogates before processing.

Windows Registry exports (.reg files) are saved as UTF-16 LE by default. When parsing .reg files programmatically in Node.js or Python, always specify the UTF-16 LE encoding. The same applies to Windows event log exports (.evtx files), Windows INF driver files, and many configuration files from Windows applications. These file formats predate the UTF-8 ubiquity and assume a UTF-16 reading environment.

Node.js's http.IncomingMessage (for incoming requests) and the fetch API's Response object both expose the response body as UTF-8 via toString() and text() respectively. If the server returns UTF-16 encoded content without a Content-Type charset declaration, the client will decode it as UTF-8 and produce garbage. The solution is to use the arrayBuffer() method to get the raw bytes and then decode manually with TextDecoder: new TextDecoder('utf-16le').decode(await response.arrayBuffer()).

Common UTF-8 vs UTF-16 coding mistakes

Using Buffer.from(str, 'binary') for text encoding is a common mistake that produces silent corruption. The 'binary' encoding (same as 'latin1') treats each character code as a single byte. For characters with code points above 255, it truncates to the lower 8 bits: the character Γ© (code point 233) round-trips correctly, but a Chinese character (code point 20013 for δΈ­) would be truncated to 0xED (109), losing data. Use 'utf8' for Unicode text and 'binary' only for true binary data that was originally constructed as a binary string.

Not handling the case where response.text() decodes as UTF-8 when the server sends UTF-16 is a persistent API integration mistake. The fetch spec says response.text() uses the encoding from the Content-Type charset. If the server sets Content-Type: application/json; charset=UTF-16 and the client calls response.text(), the browser handles the charset correctly. But if the server omits the charset, the client defaults to UTF-8 and garbles the content. Always verify the Content-Type in the Network DevTools tab when debugging encoding issues with API responses.

Assuming that string.length equals the number of visible characters is the most common mistake in JavaScript Unicode handling. A string containing one emoji has length 2 (two UTF-16 code units in a surrogate pair). A string containing a flag emoji (two regional indicator characters) has length 4. A string containing a family emoji (four emoji joined by ZWJ) has length up to 11. Use Intl.Segmenter for accurate visible character counting, not length.

Passing non-UTF-8 bytes to TextDecoder without specifying the correct label causes silent or fatal failures depending on the fatal flag. new TextDecoder() uses UTF-8 with fatal: false by default, replacing invalid sequences with U+FFFD. new TextDecoder('utf-16le') decodes as UTF-16 LE. Always specify the encoding label explicitly and use fatal: true in contexts where invalid sequences indicate a data integrity problem.

Best practices for encoding-safe code

Standardize on UTF-8 for all new code and convert legacy UTF-16 files during migration. UTF-8 is the dominant encoding on the web, in Linux systems, and in cloud environments. When receiving data from Windows systems or processing legacy files, add an explicit detection and conversion step at the boundary rather than spreading encoding awareness throughout the application.

In Python 3 applications, set the default encoding at process start: import sys; sys.stdin.reconfigure(encoding='utf-8'); sys.stdout.reconfigure(encoding='utf-8'). For production services on Linux, this is usually unnecessary because the locale default is UTF-8, but on Windows servers running Python the default encoding may be cp1252 or another Windows encoding. Setting it explicitly prevents the inconsistency.

For JavaScript applications that process text from multiple sources, create an encoding utility module that wraps TextDecoder and TextEncoder with appropriate error handling. Expose functions like decodeUtf8(buffer), decodeUtf16LE(buffer), encodeUtf8(str) that handle BOM detection and removal, lone surrogate validation, and consistent error reporting. This centralizes encoding logic and prevents it from being reimplemented inconsistently in different parts of the codebase.

When writing to files or network sockets that may be read by Windows applications, consider whether to include a UTF-8 BOM. Most modern Windows applications including Visual Studio Code handle UTF-8 without BOM correctly. Older applications may require the BOM to recognize the encoding. If you must interoperate with older Windows software, write the BOM explicitly: fs.writeFileSync(path, Buffer.concat([Buffer.from([0xEF, 0xBB, 0xBF]), Buffer.from(content, 'utf8')])).

Quick fix checklist

  • βœ“Inspect the first 4 bytes as hex to detect BOM: ef bb bf (UTF-8), ff fe (UTF-16 LE), fe ff (UTF-16 BE).
  • βœ“Pass encoding='utf-8' explicitly to Python open(); do not rely on system default.
  • βœ“Use Buffer.from(str, 'utf8') in Node.js for Unicode text; avoid 'binary' for non-ASCII content.
  • βœ“Use codePointAt instead of charCodeAt for characters outside the Basic Multilingual Plane.
  • βœ“Iterate over strings with for...of instead of index-based loops to handle surrogate pairs correctly.
  • βœ“Use TextDecoder with the encoding label 'utf-16le' for UTF-16 content in browser and Node.js.
  • βœ“Handle lone surrogates in JavaScript strings before calling encodeURIComponent or JSON.stringify.
  • βœ“Convert legacy UTF-16 files to UTF-8 at the data source using iconv rather than spreading encoding logic.

Related guides

Frequently asked questions

How do I tell if a file is UTF-8 or UTF-16?

Read the first bytes: FF FE indicates UTF-16 LE, FE FF indicates UTF-16 BE, EF BB BF indicates UTF-8 with BOM. If there is no BOM, look for a pattern of null bytes: UTF-16 LE ASCII text has null bytes after every ASCII character (48 00 65 00 for 'He'). UTF-8 ASCII text has no null bytes. Use chardet in Python or a hex editor to confirm the encoding of ambiguous files.

What is mojibake and how does it look?

Mojibake is garbled text produced by reading bytes with the wrong encoding. UTF-8 Japanese text read as Windows-1252 shows sequences of Latin accented characters instead of the original script. UTF-16 LE text read as UTF-8 shows every ASCII character followed by a null byte, producing visible garbage. The pattern of corruption reveals the original and misapplied encodings to a trained eye.

Why does my Python script show UnicodeDecodeError for a UTF-16 file?

Python's open() uses the system default encoding, which is often UTF-8 on Linux and CP1252 on Windows. A UTF-16 file starts with FF FE (the BOM), which is invalid in UTF-8 and most single-byte encodings, causing UnicodeDecodeError immediately. Fix by passing encoding='utf-16' to open(), which auto-detects the byte order from the BOM, or encoding='utf-16-le' for explicit little-endian.

What does Buffer.from(str, 'binary') do in Node.js?

The 'binary' encoding (same as 'latin1') interprets each JavaScript character code as a single byte value. Characters with code points 0-255 round-trip correctly. Characters with code points above 255 are silently truncated to their lower 8 bits, producing data loss. Use 'binary' only for data that was originally created as a binary string, never for Unicode text. For text, always use 'utf8' or 'utf16le'.

Why does charCodeAt return wrong values for emoji?

Emoji and other characters outside the Basic Multilingual Plane (code points above U+FFFF) are stored in JavaScript strings as surrogate pairs β€” two 16-bit code units. charCodeAt returns the raw 16-bit code unit value, so it returns the high surrogate value (0xD800-0xDBFF range) for the first code unit of an emoji. Use codePointAt to get the actual Unicode code point, or use for...of which combines surrogate pairs automatically.

How do I convert a UTF-16 file to UTF-8 in Node.js?

Read the raw buffer, detect the BOM, decode as UTF-16, then write as UTF-8: const raw = fs.readFileSync('input.txt'); const isBOM = raw[0] === 0xFF && raw[1] === 0xFE; const text = isBOM ? raw.toString('utf16le').replace(/^\uFEFF/, '') : raw.toString('utf8'); fs.writeFileSync('output.txt', text, 'utf8'). Use iconv-lite for UTF-16 BE: require('iconv-lite').decode(raw, 'UTF-16BE').

Does the fetch API handle UTF-16 responses correctly?

The fetch API's response.text() reads the charset from the Content-Type header and decodes accordingly. If the server sends Content-Type: text/html; charset=UTF-16, response.text() decodes correctly. If the server omits the charset or sets it incorrectly, response.text() defaults to UTF-8 and produces garbled output. Use response.arrayBuffer() and TextDecoder('utf-16le') to manually decode when the Content-Type header cannot be trusted.

What is the UTF-8 BOM and should I use it?

The UTF-8 BOM is the three bytes EF BB BF (the U+FEFF character encoded as UTF-8). It is optional and not recommended for UTF-8 because UTF-8 has no byte order ambiguity. However, some Windows tools require it to recognize a file as UTF-8. Use encoding='utf-8-sig' in Python to strip the BOM when reading. For new files, omit the BOM unless you specifically need Windows compatibility with older applications.

All tools run in your browser. Your data never leaves your device. Last updated: 2026-05-06.