Skip to content

Gemini outputs spurious CJK characters at string boundaries (tokenization artifact) #1969

@lmwilki

Description

@lmwilki

When using Gemini models for structured extraction tasks, we're observing random Chinese characters appearing at string boundaries in generated output. This appears to be a BPE tokenization artifact where partial CJK byte sequences leak into the output when generation terminates mid-token.

Affected Model

  • gemini-3-flash-preview (via Vertex AI)
  • Possibly affects other Gemini models (untested)

Observed Characters

Character Unicode Frequency
U+8F6C Most common
U+6781 Second most common
U+5BC6 Occasional
待遇 U+5F85 U+9047 Occasional (2-char sequence)
U+8F49 Rare (traditional variant of 转)

Position Analysis

From a corpus of ~31,000 extracted strings:

  • 96% appear at END of strings
  • 4% appear at START of strings
  • Characters have no semantic relationship to the content

Impact

In a production extraction pipeline processing ~31,000 entities:

  • 1,323 strings affected (~4.2% of output)
  • Corrupted strings required post-processing cleanup
  • 196 entities had to be merged after sanitization (corrupted duplicates of clean entities)

Example Outputs

Input context: "The CEO announced quarterly results"
Expected: "quarterly results"
Actual: "quarterly results转"

Input context: "Environmental policy framework"  
Expected: "Environmental Land Management Schemes (ELMs)"
Actual: "Environmental Land Management Schemes (ELMs)转"

Input context: "International organization"
Expected: "International Monetary Fund"
Actual: "International Monetary Fund转"

Workaround

Per #1238, setting thinkingBudget to a non-zero value (even 1) reportedly reduces this issue. We've also implemented client-side sanitization to strip trailing CJK characters as a defensive measure.

Environment

  • google-genai: 1.59.0
  • google-adk: 1.22.1
  • Python: 3.12
  • Region: Vertex AI (europe-west2)

Related Issues

Suggested Investigation

The pattern suggests this is a tokenizer boundary issue where:

  1. BPE tokens for CJK characters are multi-byte sequences
  2. When generation stops mid-token, partial bytes are decoded
  3. These partial sequences happen to form valid CJK codepoints

This would explain why specific characters repeat (they're the "default" decoding of common partial byte patterns).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions