morphoBPE

Subword-based comparative linguistics toolkit for analyzing 242+ languages using BPE tokenization.

Paper

This repository contains the implementation for:

"Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets"

Authors: Iaroslav Chelombitko, Mika Hämäläinen, Aleksey Komissarov
Venues: ACL 2025 (submitted)
Lab Journal: aglabx/labjournal

Features

BPE Training: Word-only BPE tokenizer (no space tokens) with position tracking
Tokenization Trees: Hierarchical visualization of subword decomposition
Cross-Language Comparison: Merge graph analysis between language-specific tokenizers
Script-Level Analysis: Combined tokenizers for Latin (205 languages) and Cyrillic (37 languages)

Components

Core BPE

File	Description
`bpe.py`	Python BPE trainer with vocab/min-freq modes
`tokenizer.py`	Tokenizer with merge tree visualization
`bpes/bpe.cpp`	C++ BPE implementation
`bpes/bpe_sa.cpp`	Suffix-array optimized BPE

Comparison Tools

File	Description
`compare_merge_structures.py`	Merge graph analysis with networkx/graphviz
`compare_tokenizers.py`	Cross-language tokenizer comparison

Data Pipeline

File	Description
`1_download_uralic_cc.py`	Download Uralic languages from Common Crawl
`2_convert_arrow.py`	Convert to Arrow format
`3_aggregate_texts.py`	Aggregate texts by language
`from_text_to_tfdf.py`	Extract TF-DF from text

Preprocessing (C++)

File	Description
`tf_df.cpp`	Fast TF-DF extraction
`clean_text.cpp`	Text cleaning and normalization

Usage

Train BPE tokenizer

# With vocabulary size limit
python bpe.py input.tsv --vocab-size 4096 --output-file vocab.json

# With minimum frequency threshold
python bpe.py input.tsv --min-freq 2 --output-file vocab.json

Tokenize and visualize

from tokenizer import Tokenizer

tok = Tokenizer('vocab.json')

# Get tokenization tree
trees = tok.get_tokenization_tree("промисловість")
for tree in trees:
    tok.print_token_tree(tree)

Compare tokenizers

python compare_merge_structures.py vocab_uk.json vocab_ru.json --output-dir analysis/

Datasets

Wikipedia dumps (320 languages): dumps.wikimedia.org/kiwix/zim/wikipedia/
Processed glottosets: Hugging Face (link TBD)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

morphoBPE

Paper

Features

Components

Core BPE

Comparison Tools

Data Pipeline

Preprocessing (C++)

Usage

Train BPE tokenizer

Tokenize and visualize

Compare tokenizers

Datasets

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
bpes		bpes
.gitignore		.gitignore
1_download_uralic_cc.py		1_download_uralic_cc.py
2_convert_arrow.py		2_convert_arrow.py
3_aggregate_texts.py		3_aggregate_texts.py
LICENSE		LICENSE
README.md		README.md
bpe.py		bpe.py
clean_text.cpp		clean_text.cpp
compare_merge_structures.py		compare_merge_structures.py
compare_tokenizers.py		compare_tokenizers.py
eval_e1_baseline_fasttext.py		eval_e1_baseline_fasttext.py
eval_e1_language_id.py		eval_e1_language_id.py
from_text_to_tfdf.py		from_text_to_tfdf.py
parallel_bpe.py		parallel_bpe.py
process_files		process_files
tf_df		tf_df
tf_df.cpp		tf_df.cpp
tokenizer.py		tokenizer.py

License

aglabx/morphoBPE

Folders and files

Latest commit

History

Repository files navigation

morphoBPE

Paper

Features

Components

Core BPE

Comparison Tools

Data Pipeline

Preprocessing (C++)

Usage

Train BPE tokenizer

Tokenize and visualize

Compare tokenizers

Datasets

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages