Skip to content

Conversation

@seanbudd
Copy link
Member

@seanbudd seanbudd commented Jan 9, 2026

Reverts PR

Reverts:

Issues fixed

Fixes #19298

Issues reopened

Reopens #16281

Reason for revert / Can this PR be reimplemented? If so, what is required for the next attempt

The current implementation of AI image descriptions yields low quality captions from a 3 year old model (see #19298).
The current implementation also requires using numpy, which hogs RAM, slows initialization, and increases the weight of the installer.
An attempt was made to convert this to C++ using WinML and Windows ONNX runtimes as per #18662.
This would have removed numpy, and improved flexibility for using different models in the future.
Unfortunately, this was not found to be feasible, as ONNX C++ fails to work via 64bit emulation on ARM (microsoft/onnxruntime#15403).

This means we have the following options for image descriptions:

  1. Continue to use the python onnxruntime, and accept the RAM and storage hits. Instead, improve the quality of the captioner with better models such as git-base-coco or blip2.
  2. Wait until MS builds ARM64EC into C++ ONNX (blocked by OnnxRuntime for Windows on Arm as Arm64EC variant? microsoft/onnxruntime#15403)
  3. Attempt to build our own fork of ONNX with ARM64EC
  4. Build a separate ARM native installer of NVDA, offer as an alternative to allow for ARM devices to do image descriptions with numpy.
  5. Release the feature on C++ without support for ARM devices.

All of these options require a significant amount of work.
As such, sadly this feature is not ready for a stable release.

Instead this code will be moved to a feature branch, until ONNX C++ matures such as fixing microsoft/onnxruntime#15403.
Additionally, ONNX C++ runtimes are only available through the experimental 2.0 version of the Windows App SDK, and requires you to build your own headers from it.
I think this feature will be blocked until microsoft/onnxruntime#15403 is implemented and the 2.0 version of the Windows App SDK becomes stable.
Future re-implementations should also consider using higher quality, more modern models.

Copilot AI review requested due to automatic review settings January 9, 2026 05:28
@seanbudd seanbudd requested review from a team as code owners January 9, 2026 05:28
@seanbudd seanbudd added this to the 2026.1 milestone Jan 9, 2026
@seanbudd seanbudd changed the title Revert image description work Revert AI image description work Jan 9, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request reverts the AI image descriptions feature that was previously introduced across multiple PRs (#18475, #19036, #19024, #19055, #19057, #19178, #19243, #19327, and partial #19342). The revert is motivated by quality concerns with the 3-year-old model producing low-quality captions, and technical challenges with numpy dependencies causing RAM/storage overhead and ARM64 compatibility issues.

Key Changes:

  • Removes on-device AI image captioning functionality and the NVDA+g gesture
  • Eliminates numpy and onnxruntime dependencies from the codebase
  • Removes the _localCaptioner module and all related GUI components

Reviewed changes

Copilot reviewed 28 out of 31 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
uv.lock Removes numpy, onnxruntime, onnx, coloredlogs, flatbuffers, ml-dtypes, mpmath, sympy and related dependencies; updates remaining packages to newer versions
pyproject.toml Removes numpy, onnxruntime, and onnx from dependencies and system-tests
source/setup.py Moves numpy from packages to excludes list; removes numpy-specific includes
source/config/configSpec.py Removes automatedImageDescriptions config section; fixes indentation inconsistencies
source/config/init.py Removes automatedImageDescriptions from profile sections
source/NVDAState.py Removes modelsDir property
source/core.py Removes _localCaptioner initialization and termination
source/globalCommands.py Removes image description scripts and SCRCAT_IMAGE_DESC category
source/gui/init.py Removes LocalCaptionerSettingsPanel references
source/gui/settingsDialogs.py Removes LocalCaptionerSettingsPanel class
source/gui/blockAction.py Removes SCREEN_CURTAIN context check
source/_localCaptioner/* Removes entire module including captioner, downloader, and UI components
source/gui/_localCaptioner/* Removes dialog implementations
tests/unit/test_localCaptioner/* Removes unit tests
tests/system/robot/automatedImageDescriptions.* Removes system tests
tests/system/nvdaSettingsFiles/standard-doLoadMockModel.ini Removes test configuration
tests/system/libraries/SystemTestSpy/mockModels.py Removes mock model generator
tests/system/libraries/SystemTestSpy/configManager.py Removes model configuration logic
user_docs/en/userGuide.md Removes Image Captioner section and references
user_docs/en/changes.md Removes feature announcement from changelog
.github/workflows/testAndPublish.yml Removes imageDescriptions test job

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tianzeshi-study
Copy link
Contributor

The current implementation of AI image descriptions yields low quality captions from a 3 year old model (see #19298). The current implementation also requires using numpy, which hogs RAM, slows initialization, and increases the weight of the installer. An attempt was made to convert this to C++ using WinML and Windows ONNX runtimes as per #18662. This would have removed numpy, and improved flexibility for using different models in the future. Unfortunately, this was not found to be feasible, as ONNX C++ fails to work via 64bit emulation on ARM (microsoft/onnxruntime#15403).

This means we have the following options for image descriptions:

1. Continue to use the python onnxruntime, and accept the RAM and storage hits. Instead, improve the quality of the captioner with better models such as [git-base-coco](https://huggingface.co/microsoft/git-base-coco) or [blip2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).

It is worth noting that, in addition to the default model, the current architecture can run Mozilla’s distilvit with zero code changes, which is not a three-year-old model.

At the same time, the proposed “better” models (BLIP-2 and GIT-base-COCO) are from roughly the same period as vit-gpt2-image-captioning. Their performance and output quality may still need to be validated, and they may not actually perform as well as expected.

To be honest, running models on consumer-grade CPUs is not a major focus of current industry and research efforts. As a result, there are only a limited number of transformer-based models that can produce results within three seconds on a CPU.

That said, we still hope that more capable multimodal models will emerge in the future to address this gap.

Regarding the use of numpy:

if having it as a dependency is truly unacceptable, then in the future any offline model–based translation or OCR features would also be unable to rely on numpy and would have to be implemented in C++ instead. This would represent a significant amount of work.Additionally, introducing too many submodules could make the repository increasingly bloated and harder to maintain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants