-
-
Notifications
You must be signed in to change notification settings - Fork 735
Revert AI image description work #19425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: beta
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request reverts the AI image descriptions feature that was previously introduced across multiple PRs (#18475, #19036, #19024, #19055, #19057, #19178, #19243, #19327, and partial #19342). The revert is motivated by quality concerns with the 3-year-old model producing low-quality captions, and technical challenges with numpy dependencies causing RAM/storage overhead and ARM64 compatibility issues.
Key Changes:
- Removes on-device AI image captioning functionality and the
NVDA+ggesture - Eliminates numpy and onnxruntime dependencies from the codebase
- Removes the _localCaptioner module and all related GUI components
Reviewed changes
Copilot reviewed 28 out of 31 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Removes numpy, onnxruntime, onnx, coloredlogs, flatbuffers, ml-dtypes, mpmath, sympy and related dependencies; updates remaining packages to newer versions |
| pyproject.toml | Removes numpy, onnxruntime, and onnx from dependencies and system-tests |
| source/setup.py | Moves numpy from packages to excludes list; removes numpy-specific includes |
| source/config/configSpec.py | Removes automatedImageDescriptions config section; fixes indentation inconsistencies |
| source/config/init.py | Removes automatedImageDescriptions from profile sections |
| source/NVDAState.py | Removes modelsDir property |
| source/core.py | Removes _localCaptioner initialization and termination |
| source/globalCommands.py | Removes image description scripts and SCRCAT_IMAGE_DESC category |
| source/gui/init.py | Removes LocalCaptionerSettingsPanel references |
| source/gui/settingsDialogs.py | Removes LocalCaptionerSettingsPanel class |
| source/gui/blockAction.py | Removes SCREEN_CURTAIN context check |
| source/_localCaptioner/* | Removes entire module including captioner, downloader, and UI components |
| source/gui/_localCaptioner/* | Removes dialog implementations |
| tests/unit/test_localCaptioner/* | Removes unit tests |
| tests/system/robot/automatedImageDescriptions.* | Removes system tests |
| tests/system/nvdaSettingsFiles/standard-doLoadMockModel.ini | Removes test configuration |
| tests/system/libraries/SystemTestSpy/mockModels.py | Removes mock model generator |
| tests/system/libraries/SystemTestSpy/configManager.py | Removes model configuration logic |
| user_docs/en/userGuide.md | Removes Image Captioner section and references |
| user_docs/en/changes.md | Removes feature announcement from changelog |
| .github/workflows/testAndPublish.yml | Removes imageDescriptions test job |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
It is worth noting that, in addition to the default model, the current architecture can run Mozilla’s distilvit with zero code changes, which is not a three-year-old model. At the same time, the proposed “better” models (BLIP-2 and GIT-base-COCO) are from roughly the same period as vit-gpt2-image-captioning. Their performance and output quality may still need to be validated, and they may not actually perform as well as expected. To be honest, running models on consumer-grade CPUs is not a major focus of current industry and research efforts. As a result, there are only a limited number of transformer-based models that can produce results within three seconds on a CPU. That said, we still hope that more capable multimodal models will emerge in the future to address this gap. Regarding the use of numpy: if having it as a dependency is truly unacceptable, then in the future any offline model–based translation or OCR features would also be unable to rely on numpy and would have to be implemented in C++ instead. This would represent a significant amount of work.Additionally, introducing too many submodules could make the repository increasingly bloated and harder to maintain. |
Reverts PR
Reverts:
Issues fixed
Fixes #19298
Issues reopened
Reopens #16281
Reason for revert / Can this PR be reimplemented? If so, what is required for the next attempt
The current implementation of AI image descriptions yields low quality captions from a 3 year old model (see #19298).
The current implementation also requires using numpy, which hogs RAM, slows initialization, and increases the weight of the installer.
An attempt was made to convert this to C++ using WinML and Windows ONNX runtimes as per #18662.
This would have removed numpy, and improved flexibility for using different models in the future.
Unfortunately, this was not found to be feasible, as ONNX C++ fails to work via 64bit emulation on ARM (microsoft/onnxruntime#15403).
This means we have the following options for image descriptions:
All of these options require a significant amount of work.
As such, sadly this feature is not ready for a stable release.
Instead this code will be moved to a feature branch, until ONNX C++ matures such as fixing microsoft/onnxruntime#15403.
Additionally, ONNX C++ runtimes are only available through the experimental 2.0 version of the Windows App SDK, and requires you to build your own headers from it.
I think this feature will be blocked until microsoft/onnxruntime#15403 is implemented and the 2.0 version of the Windows App SDK becomes stable.
Future re-implementations should also consider using higher quality, more modern models.