-
-
Notifications
You must be signed in to change notification settings - Fork 34.7k
Description
Version
No response
Platform
**Summary:** releaseWritingBuf() in lib/internal/streams/fast-utf8-stream.js incorrectly calculates string slice positions when fs.write returns a byte count that splits a multi-byte UTF-8 character, causing silent data corruption (lost characters, lone surrogates in output).
**Description:**
The releaseWritingBuf function (line 896) converts bytes-written to character count using:
n = Buffer.from(writingBuf).subarray(0, n).toString().length;
When n bytes cuts through a multi-byte character, the incomplete UTF-8 sequence becomes U+FFFD (replacement character) via .toString(). This replacement character has a different .length than the original character in JS UTF-16, causing .slice(n) to cut at the wrong position:
- 3-byte characters (CJK, most non-Latin): character silently dropped from output
- 4-byte characters (emoji, supplementary CJK): lone low surrogate left in remaining buffer, producing invalid UTF-8 on next write
The file was recently added (January 2026), ported from SonicBoom. It is used as the fast path for streaming UTF-8 output.
## Steps To Reproduce:
1. Save this as poc.js and run with node poc.js:
// Reproduces the releaseWritingBuf logic from lib/internal/streams/fast-utf8-stream.js lines 896-906
function releaseWritingBuf(writingBuf, len, n) {
if (typeof writingBuf === 'string' && Buffer.byteLength(writingBuf) !== n) {
n = Buffer.from(writingBuf).subarray(0, n).toString().length;
}
len = Math.max(len - n, 0);
writingBuf = writingBuf.slice(n);
return { writingBuf, len };
}
// Case 1: 4-byte emoji split at byte 7 — lone surrogate
const r1 = releaseWritingBuf("hello🌍world", 14, 7);
console.log("Case 1 - Emoji split:");
console.log(" Result:", JSON.stringify(r1.writingBuf));
console.log(" Expected:", JSON.stringify("🌍world"));
console.log(" First char code: 0x" + r1.writingBuf.charCodeAt(0).toString(16));
console.log(" Is lone surrogate:", r1.writingBuf.charCodeAt(0) >= 0xDC00 &&
r1.writingBuf.charCodeAt(0) <= 0xDFFF);
// Case 2: 3-byte CJK char split at byte 4 — character lost
const r2 = releaseWritingBuf("abc中def", 9, 4);
console.log("\nCase 2 - CJK split:");
console.log(" Result:", JSON.stringify(r2.writingBuf));
console.log(" Expected:", JSON.stringify("中def"));
console.log(" Character 中 lost:", !r2.writingBuf.includes("中"));
2. Output shows:
Case 1 - Emoji split:
Result: "\udf0dworld" ← CORRUPTED (lone surrogate)
Expected: "🌍world"
First char code: 0xdf0d
Is lone surrogate: true
Case 2 - CJK split:
Result: "def" ← CHARACTER LOST
Expected: "中def"
Character 中 lost: true
3. The vulnerable code is at:
https://github.com/nodejs/node/blob/main/lib/internal/streams/fast-utf8-stream.js#L896-L906
Partial fs.write returns are possible when writing to pipes near capacity, under disk I/O pressure, or to Docker log pipes (the exact use case mentioned in the file's comments on line 69-70).
Additional finding: Line 240 has a typo from the SonicBoom port — this._asyncDrainScheduled should be this.#asyncDrainScheduled. All other 5 references use the private field correctly. The newListener handler is effectively dead code.
## Impact:
Silent data corruption in output files. Applications using Utf8Stream for logging with international characters (CJK, emoji, Cyrillic) can produce corrupted output when partial writes occur. 3-byte characters are silently lost (no error emitted). 4-byte characters produce invalid UTF-8 (lone surrogates). This is especially relevant for the Docker container logging use case the file was designed for.
## Supporting Material/References:
- Vulnerable function: releaseWritingBuf() at https://github.com/nodejs/node/blob/main/lib/internal/streams/fast-utf8-stream.js#L896-L906
- Typo (secondary): line 240, _asyncDrainScheduled vs #asyncDrainScheduled
- File derived from SonicBoom (https://github.com/pinojs/sonic-boom) — the original has a similar issue but uses _ prefix consistently
- The PoC script above is standalone and runs on any Node.js version
Subsystem
No response
What steps will reproduce the bug?
-
Save the following script as
poc.jsand run it withnode poc.js. -
The script reproduces the exact logic from
lib/internal/streams/fast-utf8-stream.js(lines 896–906),
specifically thereleaseWritingBuf()function. -
The script simulates partial
fs.write()behavior where the number
of bytes written splits a multi-byte UTF-8 character. -
Observe the output:
- When a 4-byte UTF-8 character (emoji) is split, a lone surrogate
remains in the output. - When a 3-byte UTF-8 character (CJK) is split, the character is
silently dropped.
- When a 4-byte UTF-8 character (emoji) is split, a lone surrogate
-
This demonstrates incorrect string slicing caused by converting
byte counts to character counts via.toString().length.
How often does it reproduce? Is there a required condition?
It reproduces deterministically whenever fs.write() (or an equivalent
internal write) returns a byte count that splits a multi-byte UTF-8
character.
The issue is not timing-dependent or race-based. The required condition
is a partial write that ends in the middle of a UTF-8 sequence.
This can occur when writing to pipes, sockets, or log streams under
backpressure (e.g. near-capacity pipes, Docker container logs, or heavy
I/O), which is a documented and expected behavior of fs.write().
What is the expected behavior? Why is that the expected behavior?
The output must always preserve valid UTF-8 and must not silently
corrupt data.
When a partial write ends in the middle of a multi-byte UTF-8 character,
the remaining bytes for that character should be preserved and written
in a subsequent write, rather than being dropped or converted into
replacement characters.
This is the expected behavior because:
fs.write()is documented to return partial byte counts.- UTF-8 stream handling must be byte-safe across writes.
- Producing lone surrogates or dropping characters violates UTF-8
correctness and results in silent data corruption.
The current behavior breaks UTF-8 invariants and can corrupt log output
in real-world streaming scenarios, such as container logging and pipe-
based streams, which this module explicitly targets.
What do you see instead?
Instead of preserving valid UTF-8 output, the stream produces corrupted
results when a partial write splits a multi-byte character.
Specifically:
- For 3-byte UTF-8 characters (e.g. CJK), the character is silently
dropped from the output with no error. - For 4-byte UTF-8 characters (e.g. emoji), the remaining buffer starts
with a lone UTF-16 surrogate, producing invalid UTF-8 on subsequent
writes.
No error or warning is emitted, resulting in silent data corruption in
the output stream.
Additional information
No response