Gemini outputs spurious CJK characters at string boundaries (tokenization artifact)

When using Gemini models for structured extraction tasks, we're observing random Chinese characters appearing at string boundaries in generated output. This appears to be a BPE tokenization artifact where partial CJK byte sequences leak into the output when generation terminates mid-token.

### Affected Model
- `gemini-3-flash-preview` (via Vertex AI)
- Possibly affects other Gemini models (untested)

### Observed Characters

| Character | Unicode | Frequency |
|-----------|---------|-----------|
| 转 | U+8F6C | Most common |
| 极 | U+6781 | Second most common |
| 密 | U+5BC6 | Occasional |
| 待遇 | U+5F85 U+9047 | Occasional (2-char sequence) |
| 轉 | U+8F49 | Rare (traditional variant of 转) |

### Position Analysis

From a corpus of ~31,000 extracted strings:
- **96% appear at END** of strings
- 4% appear at START of strings
- Characters have no semantic relationship to the content

### Impact

In a production extraction pipeline processing ~31,000 entities:
- **1,323 strings affected** (~4.2% of output)
- Corrupted strings required post-processing cleanup
- 196 entities had to be merged after sanitization (corrupted duplicates of clean entities)

### Example Outputs

```
Input context: "The CEO announced quarterly results"
Expected: "quarterly results"
Actual: "quarterly results转"

Input context: "Environmental policy framework"  
Expected: "Environmental Land Management Schemes (ELMs)"
Actual: "Environmental Land Management Schemes (ELMs)转"

Input context: "International organization"
Expected: "International Monetary Fund"
Actual: "International Monetary Fund转"
```

### Workaround

Per #1238, setting `thinkingBudget` to a non-zero value (even `1`) reportedly reduces this issue. We've also implemented client-side sanitization to strip trailing CJK characters as a defensive measure.

### Environment

- `google-genai`: 1.59.0
- `google-adk`: 1.22.1
- Python: 3.12
- Region: Vertex AI (europe-west2)

### Related Issues

- #1238 - Reports similar CJK character corruption, suggests `thinkingBudget` workaround

### Suggested Investigation

The pattern suggests this is a tokenizer boundary issue where:
1. BPE tokens for CJK characters are multi-byte sequences
2. When generation stops mid-token, partial bytes are decoded
3. These partial sequences happen to form valid CJK codepoints

This would explain why specific characters repeat (they're the "default" decoding of common partial byte patterns).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemini outputs spurious CJK characters at string boundaries (tokenization artifact) #1969

Affected Model

Observed Characters

Position Analysis

Impact

Example Outputs

Workaround

Environment

Related Issues

Suggested Investigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Character	Unicode	Frequency
转	U+8F6C	Most common
极	U+6781	Second most common
密	U+5BC6	Occasional
待遇	U+5F85 U+9047	Occasional (2-char sequence)
轉	U+8F49	Rare (traditional variant of 转)

Gemini outputs spurious CJK characters at string boundaries (tokenization artifact) #1969

Description

Affected Model

Observed Characters

Position Analysis

Impact

Example Outputs

Workaround

Environment

Related Issues

Suggested Investigation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions