JSON Special Characters Encoding: Control Characters, Smart Quotes, and UTF-8 Fixes

Quick answer

💡JSON strings must be valid UTF-8 per RFC 8259, and control characters U+0000 through U+001F must be escaped as \n, \t, \r, \b, \f, or \uXXXX. Unescaped control characters, BOM (U+FEFF), smart quotes (U+201C, U+201D), and lone surrogate code points (U+D800-U+DFFF) all cause SyntaxError in strict parsers. Use JSON.stringify in JavaScript or json.dumps in Python to produce correctly escaped JSON output automatically.

Error symptoms

  • SyntaxError: Unexpected token in JSON at the position of an unescaped control character
  • json.decoder.JSONDecodeError: Invalid control character at line N column N
  • PostgreSQL error: invalid byte sequence for encoding UTF8 in JSON columns
  • jq parse error: Invalid string: control characters must be escaped
  • JSON appears valid in a text editor but fails in a strict parser
  • Smart quotes copied from Word or Google Docs cause Unexpected token errors

Common causes

  • Unescaped newlines, tabs, or carriage returns embedded inside JSON string values
  • UTF-8 BOM (U+FEFF) prepended to a JSON file by Windows text editors
  • Smart quotes U+201C and U+201D copied from rich text editors into JSON string values
  • NUL bytes U+0000 present in data extracted from binary file formats or databases
  • Lone surrogate code points U+D800 through U+DFFF in strings from certain Windows APIs
  • Python json.dumps using ensure_ascii=False allowing non-ASCII characters that downstream parsers reject

When it happens

  • Serializing form input that users pasted from Microsoft Word or Google Docs
  • Exporting data from PostgreSQL or MySQL where text columns contain binary data
  • Processing JSON files created on Windows that have UTF-8 BOM prepended
  • Consuming JSON from Windows APIs that use UTF-16 surrogate pairs internally
  • Storing multiline text values directly in JSON without escaping the newlines

Examples and fixes

Raw user input may contain control characters and Unicode artifacts that must be escaped before embedding in JSON.

Escaping control characters and removing BOM with JSON.stringify

❌ Wrong

// User pasted text from Word with smart quotes and a BOM
const userInput = '\uFEFF\u201CHello world\u201D\nSecond line';

// Direct string interpolation produces invalid JSON
const broken = `{ "message": "${userInput}" }`;

// JSON.parse will fail: Unexpected token
const parsed = JSON.parse(broken);

✅ Fixed

// User pasted text from Word with smart quotes and a BOM
const userInput = '\uFEFF\u201CHello world\u201D\nSecond line';

// Strip BOM and normalize quotes before stringifying
const cleaned = userInput
  .replace(/^\uFEFF/, '')      // Remove BOM
  .replace(/[\u2018\u2019]/g, "'")  // Normalize smart single quotes
  .replace(/[\u201C\u201D]/g, '"'); // Normalize smart double quotes

// JSON.stringify escapes remaining special chars automatically
const safe = JSON.stringify({ message: cleaned });
const parsed = JSON.parse(safe);

Embedding user input into a JSON string using template literals bypasses all escaping. The resulting string contains literal BOM, smart quotes, and unescaped newlines that strict JSON parsers reject. JSON.stringify automatically escapes all control characters including \n, \t, and NUL bytes to their \uXXXX equivalents. The pre-processing step replaces typographic quotes with standard ASCII quotes because JSON.stringify does not normalize Unicode quotation marks — they are valid UTF-8 characters and get passed through unmodified. The BOM must be stripped because most parsers treat it as an unexpected token at the start of the document.

Python's default json.dumps behavior may differ from what downstream consumers expect for non-ASCII characters.

Handling ensure_ascii in Python json.dumps

❌ Wrong

import json

# User data with emoji and accented characters
data = {'name': 'Renee', 'bio': 'Developer. Builder.'}

# ensure_ascii=False writes raw UTF-8 bytes
json_str = json.dumps(data, ensure_ascii=False)
# Result: {"name": "Renee", "bio": "Developer. Builder."}

# Some parsers reject raw non-ASCII in certain transport contexts
# PostgreSQL JSON column may reject bytes above 0x7F in some configs

✅ Fixed

import json

data = {'name': 'Renee', 'bio': 'Developer. Builder.'}

# Default ensure_ascii=True escapes all non-ASCII safely
json_str = json.dumps(data)
# Result: {"name": "Ren\u00e9e", "bio": "Developer. Builder."}

# Or keep UTF-8 but validate the encoding explicitly
json_utf8 = json.dumps(data, ensure_ascii=False).encode('utf-8')
# Verify no invalid sequences before storing or transmitting
assert json_utf8.decode('utf-8')  # Raises UnicodeDecodeError if invalid

Python's json.dumps defaults to ensure_ascii=True, which escapes every non-ASCII character as a \uXXXX escape sequence. The resulting string contains only ASCII bytes and is safe for any transport layer, database column, or parser. Setting ensure_ascii=False allows raw UTF-8 characters to pass through, which is more human-readable but depends on all downstream consumers handling UTF-8 correctly. PostgreSQL JSON columns, HTTP headers, and some older parsers can mishandle raw non-ASCII bytes. When in doubt, prefer ensure_ascii=True for data that crosses system boundaries, and ensure_ascii=False only for data that stays within a well-controlled Python environment.

RFC 8259 rules for JSON strings

RFC 8259, the JSON specification, defines strict requirements for string encoding that go beyond simply being valid UTF-8. Every JSON string must be enclosed in double quotation marks. Inside the string, certain characters must always be escaped: double quote as \", reverse solidus (backslash) as \\, and all control characters in the range U+0000 through U+001F. Control characters in this range include the familiar ones: newline (U+000A) as \n, horizontal tab (U+0009) as \t, carriage return (U+000D) as \r, form feed (U+000C) as \f, and backspace (U+0008) as \b. Any other control character in this range must be escaped as \uXXXX.

An unescaped newline inside a JSON string is not whitespace that parsers tolerate — it is a literal violation of the grammar. Some lenient parsers, particularly older browser-based ones, silently accept unescaped newlines inside strings. Strict parsers, which include most server-side parsers and all current ECMAScript-compliant parsers, throw a SyntaxError immediately. Code that works in a browser but fails in a Node.js server is often caused by this difference in strictness.

BOM, the Unicode byte order mark U+FEFF, is a zero-width non-breaking space that Windows text editors sometimes prepend to UTF-8 files to identify the file encoding. JSON parsers are not required to tolerate a BOM, and most do not. A JSON file that starts with a BOM will fail with Unexpected token on the BOM character before the parser even sees the opening brace or bracket. The BOM is invisible in most text editors but visible in hex dumps and in parser error messages that display the character code.

Smart quotes — the typographic left double quotation mark U+201C and right double quotation mark U+201D — are valid UTF-8 characters and can appear inside JSON string values where they are treated as regular text. The problem arises when a developer copies JSON from a document editor that automatically converts straight double quotes to smart quotes. The smart quotes replace the JSON structural delimiter characters, breaking the string boundaries. The parser sees a string that never closes because the expected U+0022 closing delimiter is replaced by U+201D.

Surrogate code points U+D800 through U+DFFF are reserved in Unicode for representing characters outside the Basic Multilingual Plane using pairs in UTF-16 encoding. RFC 8259 states that lone surrogates are not permitted in JSON strings. JavaScript strings internally use UTF-16 and can contain lone surrogates, but JSON.stringify will encode them as \uXXXX escape sequences. Parsers that receive a \uD800 escape that is not followed by a valid low surrogate \uDC00-\uDFFF may reject the string as invalid.

Identifying the offending character precisely

When a JSON parser reports a character position in its error message, use that position to identify the exact problematic character. In Node.js, JSON.parse error messages include the character position as JSON at position N. Convert this position to a line and column number by splitting the string on newlines and counting characters. The Node.js REPL allows you to print the character code at a specific position: console.log(jsonString.charCodeAt(N)) prints the decimal Unicode code point of the character at position N.

For files, xxd or od on Linux and macOS provide hex dumps that show every byte including invisible characters. Run xxd file.json | head -5 to see the first few lines of the hex dump. A BOM appears as the bytes EF BB BF at the very start of the file. NUL bytes appear as 00 in the hex dump. Control characters appear in the range 00 through 1F. This is often more reliable than a text editor view because invisible characters are explicitly shown as hex codes.

Python's json module includes the error position in the JSONDecodeError exception. The lineno and colno attributes give the exact location. You can extract the problematic character with raw_json.splitlines()[lineno-1][colno-1] to see exactly what character caused the failure. For control characters, use ord() to get the code point: ord(char) will show values like 10 for newline, 13 for carriage return, or 0 for NUL.

For jq, run jq '.' file.json and inspect the error message. jq reports Invalid string: control characters must be escaped along with a byte offset. Use xxd to inspect that offset in the file. For files where even loading into a string is risky, jq --stream '.' file.json processes the file incrementally and reports the exact error position in the streaming output.

The /tools/json-validator tool highlights syntax errors with character positions. Paste a sample of the problematic JSON and the validator will mark the exact location of the first encoding violation. For files with BOM, the validator will show an error at position 0 or 1 because the BOM is the first character in the document. Removing the BOM and re-pasting typically resolves this class of error immediately.

Fixing encoding violations in JavaScript and Python

The most reliable way to produce correctly encoded JSON in JavaScript is to always use JSON.stringify rather than manual string construction. JSON.stringify automatically escapes all control characters (U+0000 through U+001F), double quotes, and backslashes in string values. It does not normalize smart quotes or remove BOM from input strings because those are valid Unicode characters — you must strip or normalize them before passing the data to JSON.stringify.

To remove BOM from a string in JavaScript, use string.replace(/^\uFEFF/, ''). This replaces the BOM if it appears at the very start of the string. To normalize smart quotes, replace the four typographic quote characters: replace(/[\u2018\u2019]/g, "'") for single smart quotes and replace(/[\u201C\u201D]/g, '"') for double smart quotes. Apply these transformations to string values before calling JSON.stringify, not to the entire JSON output string, because modifying the output string could replace legitimate structural double quote characters.

JSON.stringify also handles NUL bytes (U+0000) correctly by escaping them as \u0000. This is important for data sourced from binary files or databases that may contain embedded NUL bytes. However, note that PostgreSQL's JSON and JSONB column types reject the \u0000 escape sequence, treating it as an invalid byte sequence. When storing JSON in PostgreSQL, strip NUL bytes from string values before encoding: value.replace(/\x00/g, '') removes all NUL characters.

In Python, the default json.dumps behavior with ensure_ascii=True escapes all non-ASCII characters as \uXXXX sequences, producing output that is safe for any transport layer or storage system. If you need to preserve actual UTF-8 characters in the output for readability, use ensure_ascii=False but explicitly encode the result as UTF-8 bytes with .encode('utf-8') and verify the encoding is valid before transmitting. For removing control characters from input strings before JSON encoding, use a regular expression: re.sub(r'[\x00-\x1f]', '', value) strips all control characters except those you want to preserve.

For jq, the tool handles UTF-8 correctly when given valid UTF-8 input. If input files contain invalid UTF-8 sequences, jq will typically produce an error rather than silently mangling the data. Pre-process files with iconv -f utf-8 -t utf-8 -c input.json > cleaned.json to strip invalid byte sequences before passing them to jq. The -c flag causes iconv to silently discard characters that cannot be represented in the target encoding.

Unusual encoding issues that appear in production

Windows-1252 encoded files are frequently mistaken for UTF-8 because the first 128 characters of Windows-1252 are identical to ASCII, which is also identical to the first 128 characters of UTF-8. Files that contain only ASCII characters parse correctly regardless of their declared encoding. Problems appear when the file contains characters in the range 0x80-0xFF, where Windows-1252 defines characters like the Euro sign (0x80) that are assigned to different code points in UTF-8. A file declared as UTF-8 but actually encoded as Windows-1252 will cause a parser error at the first non-ASCII character. Use file --mime-encoding filename to detect the actual encoding on Linux and macOS.

Databases that store text in Latin-1 or Windows-1252 encoding internally will export data that requires re-encoding before use in JSON. When extracting data via a database driver, configure the driver to return strings as Python unicode objects or JavaScript strings rather than byte strings. The driver's character set configuration determines whether the conversion happens automatically. Pycopg2 for PostgreSQL returns text columns as Python unicode strings by default in Python 3, but some older driver configurations return bytes.

The \/ forward slash escape is a valid but optional JSON escape sequence. JSON.stringify in JavaScript does not escape forward slashes by default. This matters when embedding JSON inside HTML script tags, where the sequence produces a syntax error at the JavaScript level. Some serializers produce \/ for all forward slashes defensively. If you receive JSON where forward slashes are escaped, this is valid JSON and parsers handle it correctly — it is not an encoding error.

Surrogate pair encoding becomes relevant when processing data from Windows APIs or systems that use UTF-16 internally. A string containing an emoji like the pile of poo character U+1F4A9 is represented in UTF-16 as a surrogate pair: U+D83D followed by U+DCA9. In JavaScript, this string appears as two characters with length 2. JSON.stringify correctly encodes this as \uD83D\uDCA9 when these code points appear as a valid surrogate pair. Problems occur when only one half of a surrogate pair appears in the string without its partner, which is an invalid state in Unicode and produces invalid JSON.

MongoDB's BSON format supports binary data types that have no equivalent in JSON. When exporting MongoDB documents to JSON using mongoexport, binary fields are represented as base64-encoded strings with a type marker. This encoding is specific to MongoDB's extended JSON format and may not be recognized by parsers expecting standard JSON. Ensure that any consumer of MongoDB JSON exports is aware of the extended JSON format and uses a compatible parser.

Frequent encoding mistakes in JSON workflows

Building JSON strings by concatenation or template literals instead of using JSON.stringify is the root cause of most encoding errors in JavaScript codebases. When a developer writes JSON as { "name": "${userName}" } and userName contains a double quote, backslash, or newline, the resulting string is invalid JSON. This works during development when test data contains only ASCII letters but fails in production with real user input that includes apostrophes, quotes, or multiline text.

Using ensure_ascii=False in Python without validating the output encoding for all downstream consumers is a common source of production incidents. The output looks correct in development when testing with ASCII or common Latin characters, but fails when users submit data with characters in ranges that specific downstream systems cannot handle. The safe default is ensure_ascii=True for any JSON that crosses system boundaries, reserving ensure_ascii=False only for internal data that stays within a well-controlled Python process.

Failing to strip BOM when reading JSON files created on Windows is a particularly frustrating issue because the BOM is invisible in most text editors. A developer opens the file, sees valid JSON, closes the file, and cannot understand why the parser rejects it. The diagnostic is to open the file in a hex editor or run xxd file.json | head -1 to check for the EF BB BF BOM bytes. Add a BOM-stripping step to any file-reading code that may encounter Windows-created JSON files.

Applying character escaping to the entire JSON string output rather than to individual string values is a mistake that corrupts the JSON structure. Escaping double quotes in the entire output string turns structural JSON delimiters into escape sequences, making the output unparseable. Escaping must happen to string values before they are passed to the serializer, or the serializer must handle it automatically — which JSON.stringify and json.dumps both do when used correctly.

Ignoring parser warnings about encoding and suppressing exceptions from json.loads or JSON.parse rather than fixing the underlying data. When a parser raises a character encoding error, the correct response is to find the source of the invalid data and fix it at that point in the pipeline, not to catch the exception and return empty data silently. Silent failures in JSON parsing create hard-to-diagnose data corruption where partial data appears correct until a specific code path that depends on the missing fields fails much later.

Encoding hygiene for reliable JSON production

Always construct JSON using a serializer rather than string manipulation. In JavaScript, every value that goes into JSON must pass through JSON.stringify. In Python, every value must pass through json.dumps. These functions know the JSON specification and handle escaping correctly for all valid inputs. The only exception is when embedding previously serialized JSON verbatim — in that case, parse first to validate, then re-serialize with your serializer to normalize the encoding.

Validate the encoding of all externally sourced strings before they enter your JSON pipeline. Write a small sanitization function that receives a raw string and strips characters your system cannot handle: BOM at the start, NUL bytes for PostgreSQL storage, and optionally control characters other than standard whitespace. Apply this sanitization at the boundary where external data enters your system, not at every point where the data is used. One clean entry point is easier to maintain than scattered sanitization logic.

For systems that store JSON in PostgreSQL, always strip NUL bytes (U+0000) from string values before encoding. PostgreSQL rejects both raw NUL bytes and the \u0000 JSON escape sequence in text and JSON column types. Add a database integration test that writes a JSON payload containing a NUL byte and verifies it is handled correctly. This test prevents silent data truncation when NUL bytes in user input reach the database.

When building APIs that consume JSON from mobile clients, account for emoji and supplementary Unicode characters. Mobile keyboard input frequently produces characters outside the Basic Multilingual Plane, represented as surrogate pairs in UTF-16 environments. Ensure your JSON parser handles these correctly and your storage layer supports the full Unicode character range. MySQL databases created with the utf8 character set (not utf8mb4) silently truncate strings at supplementary characters, causing data loss without error.

Implement a JSON encoding test suite that covers: a string with each category of control character, a string with BOM, a string with smart quotes, a string with emoji, a string with supplementary Unicode characters, a string with embedded null bytes, and a string with backslashes and double quotes. Run this test suite against every JSON encoder, decoder, and storage system in your infrastructure. Passing all tests confirms that the full range of Unicode input is handled correctly end to end. Validate individual JSON structures with /tools/json-validator when integrating new data sources to catch encoding issues before they reach production.

Quick fix checklist

  • Always use JSON.stringify in JavaScript rather than string concatenation for JSON
  • Strip UTF-8 BOM from file content with string.replace(/^\uFEFF/, '') before parsing
  • Normalize smart quotes to straight ASCII quotes before passing to serializers
  • Use json.dumps with ensure_ascii=True (default) for data crossing system boundaries
  • Strip NUL bytes from string values before storing JSON in PostgreSQL columns
  • Pre-process files from unknown sources with iconv -f utf-8 -t utf-8 -c before using jq
  • Check character codes at error positions with charCodeAt() or ord() to identify invisible characters
  • Validate encoding with /tools/json-validator when integrating new external data sources

Related guides

Frequently asked questions

Why does my JSON parse in the browser but fail on the server?

Browsers implement lenient JSON parsers that tolerate some violations of RFC 8259, such as unescaped newlines inside strings. Server-side parsers in Node.js, Python, and Java follow the specification strictly. Code that constructs JSON by string concatenation rather than using JSON.stringify often works in browsers but fails in strict parsers. Always use JSON.stringify to produce JSON and test with the server-side parser, not only the browser console.

What is the UTF-8 BOM and why does it break JSON?

The UTF-8 BOM is the three-byte sequence EF BB BF (U+FEFF) that some Windows editors prepend to files to identify the encoding. JSON parsers are not required to handle BOM and most do not, treating the BOM character as an unexpected token before the opening brace. The BOM is invisible in most text editors, making this error confusing to diagnose. Strip BOM with string.replace(/^\uFEFF/, '') in JavaScript or open files with utf-8-sig encoding in Python.

Does JSON.stringify handle all special characters automatically?

JSON.stringify correctly escapes all control characters (U+0000-U+001F), double quotes, and backslashes in string values. It passes through valid UTF-8 characters including emoji and accented letters unchanged. It does not normalize smart quotes (U+201C, U+201D) or remove BOM from input strings because these are valid Unicode characters. Strip or normalize typographic characters before calling JSON.stringify if they might appear in user input.

What does ensure_ascii=True do in Python's json.dumps?

ensure_ascii=True (the default) causes json.dumps to escape every non-ASCII character as a \uXXXX escape sequence. The output contains only printable ASCII bytes and is safe for any transport layer, database, or parser. Setting ensure_ascii=False allows raw UTF-8 characters to pass through unescaped, which is more readable but depends on all consumers handling UTF-8 correctly. Use ensure_ascii=True for data crossing system boundaries.

Why does PostgreSQL reject my JSON with valid-looking strings?

PostgreSQL's JSON and JSONB column types reject the \u0000 escape sequence (NUL byte) and raw NUL bytes in string values. This is a PostgreSQL-specific restriction beyond the JSON specification. Strip NUL bytes from string values before encoding: in Python, use value.replace('\x00', ''), and in JavaScript, value.replace(/\x00/g, ''). PostgreSQL JSONB also validates UTF-8 strictly and rejects invalid byte sequences that some parsers might tolerate.

What are smart quotes and why do they break JSON?

Smart quotes are the typographic characters left double quotation mark U+201C and right double quotation mark U+201D, used in publishing and word processors. When developers copy JSON examples from Google Docs, Microsoft Word, or blog posts, these characters may replace the standard straight double quote U+0022 that JSON requires as a string delimiter. The parser reads a string that never properly closes, causing a SyntaxError. Replace smart quotes with standard ASCII quotes before parsing.

How should I handle control characters in user-submitted text?

Strip control characters from user input before embedding in JSON, or allow only safe whitespace. Remove characters in the range U+0000-U+001F except for newline, tab, and carriage return if your schema supports multiline text. In JavaScript: value.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F]/g, '') removes control characters while preserving tab and newline. Then pass the cleaned string to JSON.stringify which escapes the remaining control characters correctly.

Can jq handle files with invalid UTF-8 encoding?

jq requires valid UTF-8 input and will report an error when it encounters invalid byte sequences. Pre-process files with iconv -f utf-8 -t utf-8 -c input.json > cleaned.json to strip invalid bytes before passing to jq. The -c flag silently removes characters that cannot be represented in the target encoding. For files from unknown sources, running this conversion first prevents jq from failing partway through a large file. Validate cleaned output with /tools/json-validator before further processing.

How do surrogate pairs appear in JSON and when do they cause errors?

Surrogate pairs (U+D800-U+DFFF) are UTF-16 encoding pairs for characters outside the Basic Multilingual Plane like many emoji. In JSON, they appear as \uD83D\uDCA9 pairs. This is valid if both halves are present and correctly paired. A lone surrogate like \uD83D without its partner \uDCA9 is invalid per RFC 8259 and causes parse errors in strict parsers. Windows APIs often produce lone surrogates when converting from UTF-16 strings that contain isolated surrogate code points.

All tools run in your browser. Your data never leaves your device. Last updated: 2026-05-06.