Skip to content

Comments

Feature/parakeet onnx#1

Open
chuntwdev wants to merge 8 commits intoCyronlee:mainfrom
chuntwdev:feature/parakeet-onnx
Open

Feature/parakeet onnx#1
chuntwdev wants to merge 8 commits intoCyronlee:mainfrom
chuntwdev:feature/parakeet-onnx

Conversation

@chuntwdev
Copy link
Collaborator

Summary

This PR delivers full on-device STT support with sherpa-onnx, adds selectable local engines/models, and significantly improves local model download reliability and UX.

Core STT integration

  • Integrates sherpa-onnx into the macOS app via XCFramework + bridging header + project linkage updates.
  • Refactors speech backends behind a shared TranscriptionEngine abstraction:
    • AppleSpeechEngine (existing Apple path)
    • ParakeetSpeechEngine (offline local)
    • NemotronStreamingSpeechEngine (streaming local)
  • Adds local engine/model selection architecture:
    • TranscriptionEngineKind: .apple / .local
    • LocalTranscriptionModelKind: Parakeet + Nemotron
    • AppSettings persistence + backward compatibility for legacy engine key.

Local model management + download pipeline

  • Refactors LocalModelManager into a spec-driven, multi-model manager.
  • Replaces byte-stream download path with URLSessionDownloadTask delegate pipeline.
  • Adds robust large-file handling:
    • persisted resume data per model/file role
    • bounded transient retry with exponential backoff
    • staging + validation + atomic install/replace
    • cleanup of stale staging/resume artifacts
  • Improves settings UX for local models:
    • per-model picker and state
    • richer progress feedback (bytes/speed/ETA)
    • resume-aware Download action + Cancel
    • added/updated i18n keys for en / zh-Hans.

Validation-driven fixes included

  • Path handling fix for app support directories: path(percentEncoded: false).
  • Download temp-file ownership fix in delegate callback (prevents delayed move failures).
  • Nemotron online config fix: sets bpe_vocab correctly for BPE models.
  • Language selector UX fix: refresh language options when engine changes.
  • Control bar ProgressView layout warning cleanup.

Developer workflow improvements

  • Hardens scripts/build-sherpa-onnx.sh:
    • prerequisite checks
    • configurable flags (--version, --archs, --deployment-target, --jobs, --clean, --reclone, --output)
    • deterministic source/tag sync
    • modern cmake -S/-B, cmake --build, cmake --install
    • static lib presence checks + arch validation
    • atomic XCFramework output
    • improved logging/help output
  • Updates README.md / README_EN.md with local STT developer setup instructions.

Why

  • Enable practical offline/local transcription with user-selectable model backends.
  • Keep app bundle small by downloading model assets on demand.
  • Make model downloads resilient and user-friendly (resume/retry/progress visibility).
  • Preserve Apple Speech as default while expanding to local inference workflows.
  • Improve contributor onboarding for local STT development.

Test Plan

  • Build succeeds:
    • DEVELOPER_DIR="/Applications/Xcode.app/Contents/Developer" xcodebuild -project "TransFlow/TransFlow.xcodeproj" -scheme "TransFlow" -configuration Debug -sdk macosx build
  • Engine switching:
    • Apple engine shows multi-language picker.
    • Local engine constrains language to English as expected.
    • Switching back to Apple refreshes language options immediately.
  • Local model downloads:
    • progress, bytes/speed/ETA update correctly.
    • cancel + resume path works.
    • transient failure retry path is exercised.
  • Model lifecycle:
    • not downloaded -> downloading -> ready transitions.
    • delete resets model status.
  • Runtime engines:
    • Parakeet offline path starts and transcribes.
    • Nemotron streaming path initializes and transcribes.
  • Build script sanity:
    • bash -n scripts/build-sherpa-onnx.sh
    • ./scripts/build-sherpa-onnx.sh --help

Notes

  • Download integrity verification (e.g., hash checks) is intentionally deferred for a later iteration.
  • Apple speech model flow remains intact and unchanged in default behavior.

chuntwdev and others added 8 commits February 9, 2026 16:24
- Introduced a new transcription engine option for local speech recognition.
- Added model management features including download, validation, and status tracking.
- Updated settings UI to allow engine selection and model management.
- Ensured compatibility with existing Apple Speech backend as default.
- Included localization for new UI elements and model statuses.
- Added support for local Parakeet TDT backend using sherpa-onnx.
- Implemented model download, validation, and status tracking features.
- Updated settings UI to allow selection between Apple Speech and Parakeet engines.
- Included localization for new UI elements and model statuses.
- Introduced a bridging header for integrating C API with Swift.

Co-authored-by: Cursor <cursoragent@cursor.com>
…gine

- Updated ParakeetSpeechEngine to handle errors during recognizer and VAD initialization.
- Enhanced memory management by optimizing sample buffer handling and reducing unnecessary copies.
- Adjusted VAD parameters for improved performance.
- Added functionality to emit detected speech segments more efficiently.
- Updated SettingsView to refresh model statuses based on selected engine.
- Added support for local ASR models, including Nemotron and Parakeet, with corresponding localization.
- Updated AppSettings to manage selected local model and ensure backward compatibility.
- Enhanced LocalModelManager for improved model status tracking and management.
- Refactored TransFlowViewModel and SettingsView to accommodate new local model options and statuses.
- Introduced NemotronStreamingSpeechEngine for real-time speech recognition.
- Improved error handling and user feedback in the settings interface.
- Introduced LocalModelDownloadDetail struct to track download progress, speed, and estimated time.
- Updated LocalModelManager to handle download cancellation and resume functionality.
- Enhanced SettingsView to display detailed download progress and allow users to cancel ongoing downloads.
- Improved localization strings for new UI elements related to model management.
- Adjusted VAD parameters in ParakeetSpeechEngine for better performance.
- Added a new state variable for app settings in MainView.
- Implemented an onChange listener for selectedEngine to trigger loading of supported languages asynchronously.
Improve build script reliability with prerequisite checks, deterministic source sync, configurable flags, and atomic xcframework output. Document local STT developer setup in both Chinese and English READMEs.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Cyronlee
Copy link
Owner

感谢PR,我来试试🫡

另外我其实想用FluidAudio框架,onnx太底层了每次都需要编译,你觉得呢

@Cyronlee
Copy link
Owner

不错,试了下可以work,但有几个小问题:

我之前也尝试过Parakeet TDT模型,好像不适合处理实时转录,需要更多的分块来处理,所以实时区域如果没有特殊处理就无法显示,如图:
image

我感觉TDT适合拿来做后处理精校字幕

另外Nemotron Streaming 0.6B这个模型能力比较一般,测试如图
image

参考:https://github.com/FluidInference/FluidAudio/blob/main/Documentation/Models.md

@chuntwdev
Copy link
Collaborator Author

同意,实际用起来这两个模型并没有 Speech Analyzer 效果好,这个 PR 可以 Close

FluidAudio 看起来很不错,他们支持 Parakeet EOU,可以做 streaming,但是好像没有 auto capitalization 和 punctuation,我先去试试看

BTW,APP做得很好很有用,感谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants