-
Notifications
You must be signed in to change notification settings - Fork 736
Description
When using Gemini models for structured extraction tasks, we're observing random Chinese characters appearing at string boundaries in generated output. This appears to be a BPE tokenization artifact where partial CJK byte sequences leak into the output when generation terminates mid-token.
Affected Model
gemini-3-flash-preview(via Vertex AI)- Possibly affects other Gemini models (untested)
Observed Characters
| Character | Unicode | Frequency |
|---|---|---|
| 转 | U+8F6C | Most common |
| 极 | U+6781 | Second most common |
| 密 | U+5BC6 | Occasional |
| 待遇 | U+5F85 U+9047 | Occasional (2-char sequence) |
| 轉 | U+8F49 | Rare (traditional variant of 转) |
Position Analysis
From a corpus of ~31,000 extracted strings:
- 96% appear at END of strings
- 4% appear at START of strings
- Characters have no semantic relationship to the content
Impact
In a production extraction pipeline processing ~31,000 entities:
- 1,323 strings affected (~4.2% of output)
- Corrupted strings required post-processing cleanup
- 196 entities had to be merged after sanitization (corrupted duplicates of clean entities)
Example Outputs
Input context: "The CEO announced quarterly results"
Expected: "quarterly results"
Actual: "quarterly results转"
Input context: "Environmental policy framework"
Expected: "Environmental Land Management Schemes (ELMs)"
Actual: "Environmental Land Management Schemes (ELMs)转"
Input context: "International organization"
Expected: "International Monetary Fund"
Actual: "International Monetary Fund转"
Workaround
Per #1238, setting thinkingBudget to a non-zero value (even 1) reportedly reduces this issue. We've also implemented client-side sanitization to strip trailing CJK characters as a defensive measure.
Environment
google-genai: 1.59.0google-adk: 1.22.1- Python: 3.12
- Region: Vertex AI (europe-west2)
Related Issues
- Any recent updates for Gemini structured output with special characters? #1238 - Reports similar CJK character corruption, suggests
thinkingBudgetworkaround
Suggested Investigation
The pattern suggests this is a tokenizer boundary issue where:
- BPE tokens for CJK characters are multi-byte sequences
- When generation stops mid-token, partial bytes are decoded
- These partial sequences happen to form valid CJK codepoints
This would explain why specific characters repeat (they're the "default" decoding of common partial byte patterns).