Deferred Diffusion is a self-hosted, scalable AI inference stack with a fully typed, testable API. It supports local GPU workers and can route tasks to external AI providers. The system is containerized, automatically downloads all models and dependencies, and is stateless, allowing tasks to run across multiple workers without relying on local file paths. This makes deployments predictable, cross-platform, and easy to scale.
It provides a modular API and worker architecture built with FastAPI and Celery, letting local models and external providers run seamlessly in the same system. Workers can execute:
- Local ML pipelines using the Python ecosystem (e.g., diffusers, PyTorch)
- External inference tasks are currently only run via Replicate and OpenAI APIs.
- Optional advanced workflows using a ComfyUI sidecar for user-driven pipelines (experimental, WIP)
Clients interact with the API through typed REST endpoints with a built-in Swagger UI for inspection and testing.
Example Houdini and Nuke clients are included to demonstrate integration into node-based VFX pipelines.
- No dependency on unverified UIs; all interaction is via the API or official clients.
- Air-gap ready: API server and workers can run in isolated networks, exposing only necessary ports and connections to external AI providers.
- Controlled external access: Only approved providers (Replicate and OpenAI) are called via their APIs. Uploaded data is retained only as long as necessary to complete the inference and is deleted soon after, minimizing exposure.
- Traceable and reproducible: Local models are version-controlled in code; no downloading from random external repositories.
- Client / Workstations: Don't need heavy GPU's, download models or call provider API's directly.
- Server / Workers: Do not require access to your main network drives, maintaining strong isolation and clear boundaries.
- ComfyUI Sidecar (Optional): Optional sidecar on the worker host, disabled by default. Allows dynamic code execution via custom nodes and therefore requires explicit enablement, localhost only access, reduced privileges, and manual one way syncing. (Can be further hardened in production if required.)
sequenceDiagram
participant Client
participant API
participant Broker
participant Worker as Worker GPU/CPU Compute
Client->>API: POST /images/create
API->>Broker: Queue task
API->>Client: Return task_id (202 Accepted)
Broker->>Worker: Pick up task
Note over Worker: Validate and build context
Note over Worker: Run inference / Call external
Worker->>Broker: Store result
Note over Client: Client polls for completion
Client->>API: GET /images/{task_id}
API<<->>Broker: Retrieve task result
API->>Client: Base64 image
This feature enables the use of a ComfyUI sidecar to execute advanced, user-driven pipelines with support for patching workflows, modifying inputs, and updating files as needed.
We still aim to enforce stateless operations and typing validation as much as possible from the client’s perspective. The sidecar can be deployed on a separate network with limited filesystem access, maintaining strong isolation.
This diagram shows how a Worker interacts with the ComfyUI sidecar. It extends the standard domain/task flow by integrating Comfy workflows while keeping the core API → Broker → Worker logic consistent. All interactions with the ComfyUI sidecar are request scoped and stateless from the core system’s perspective.
sequenceDiagram
participant Client
participant API
participant Broker
participant Worker
participant Sidecar as Comfy (Sidecar)
Client->>API: POST /workflows/create
API->>Broker: Queue task
API->>Client: Return task_id (202 Accepted)
Broker->>Worker: Pick up task
Note over Worker: Validate and build context
Note over Worker: Patch Workflow
Worker->>Sidecar: POST /upload/image
Note over Sidecar: Base64 files now local
Worker->>Sidecar: POST /prompt
Sidecar->>Worker: Return prompt_id (200 Accepted)
Note over Sidecar: Run inference
Note over Worker: Worker polls for completion
Worker->>Sidecar: GET /history/{prompt_id}
Sidecar->>Worker: Base64 data
Worker->>Broker: Store result
Note over Client: Client polls for completion
Client->>API: GET /workflows/{task_id}
API<<->>Broker: Retrieve task result
API->>Client: Base64 image
To pull and run the latest release.
docker compose down
docker compose -f docker-compose.release.yml pull
docker compose -f docker-compose.release.yml up -d --no-buildOr run the make command.
make up-latest-releaseFor production deployment instructions, see DEPLOYMENT.md.
All services run in Docker containers - this ensures consistent environments and avoids duplicating model downloads across different setups. Nothing needs to run directly on the host machine except Docker and the client applications.
To build and run the core API and Workers:
make allTo build and run the optional ComfyUI sidecar (required for workflow tasks):
make up-comfyOnly a minimal local venv is required to get intellisense on the packages, it-test calls and client generation.
./start_venv_setup.batOr make your own env and install the requirements. We don't add pytorch directly to the requirements as the container base image handles this. This is good as you don't need the cuda version locally.
pip install torch torchvision torchaudio
pip install -r api/requirements.txt
pip install -r workers/requirements.txtPytest is used for integration tests confirming the models run.
You can call from the make file.
make test-worker
make test-it-testsSee the make file for more info.
We have a GitHub action setup to do the release based on any v*.*.* tag.
To make a local release you can also run the make commands.
make create-release
make tag-and-push- Storage: An NVMe drive with at least 500GB of available space is recommended.
- GPU: Nvidia GPU with at least 12GB VRAM. 24GB recommended (Tested with RTX 3080ti, A4000, RTX 3090, RTX 5090)
- RAM: Around 48-64Gb should be plenty for all containers.
- Environment Variables: Ensure all required environment variables are set on the host.
Server for the containers
OPENAI_API_KEY=your-openai-key # For OpenAI services
REPLICATE_API_TOKEN=your-replicate-token # For Replicate API access
HF_TOKEN=your-huggingface-token # For Hugging Face model access
DDIFFUSION_ADMIN_KEY=******* # Admin key for managing API keysNote: You must use the
DDIFFUSION_ADMIN_KEYto create your first API key via the/admin/keysendpoint before you can use the clients or Swagger UI.
For the clients where the tool sets are used
DDIFFUSION_API_ADDRESS=http://127.0.0.1:5000 # API server address
DDIFFUSION_API_KEY=******* # API key for client authenticationThis project follows a feature-based structure, grouping related components together by domain (images, texts, videos). This approach ensures a clear separation of concerns and improves maintainability, scalability, and collaboration.
We try to use plural to adhere to REST best practices.
- All components related to a specific AI task (
images,texts,videos) are grouped together. - They are grouped in a sense of what main data type they return, but can have multi model inputs.
- eg. images can accept image and text inputs but always returns image based data.
- Eliminates the need to navigate across multiple directories to understand a feature.
- New developers can quickly locate relevant code without confusion.
- AI models often require domain-specific logic. Keeping
schemas.py,context.py, andtasks/in the same module makes it easier to extend functionality. - If a new AI domain (
audio,3D, etc.) is introduced, the structure remains consistent just duplicate the existing pattern.
/api
│── /images # Grouped by results type
│ ├── schemas.py # ✅ Pydantic schemas (data validation)
│ ├── router.py # ✅ API routes (FastAPI) Calls worker tasks
│── /texts
│ ├── ...
│── /videos
│ ├── ...
│── /workflows # flexible user driven comfyui workflows (experimental, WIP)
│ ├── ...
│── /common # ✅ Shared components
│── /utils # ✅ General-purpose utilities (helpers, formatters, etc.)
│── /tests # ✅ Tests mirror the /api structure
│── main.py # ✅ FastAPI entry point
│── worker.py # ✅ Celery
│── pytest.ini # ✅ Test configuration
/workers
│── /images # Grouped by results type
│ ├── local/ # ✅ Local AI model pipeline tasks (GPU queue)
│ ├── external/ # ✅ External AI model pipeline tasks (CPU queue)
│ ├── schemas.py # ✅ Pydantic schemas (data validation mirrors from API)
│ ├── context.py # ✅ Business logic layer
│ ├── tasks.py # ✅ Celery tasks route to local or external tasks. Name should match module
│── /texts
│ ├── ...
│── /videos
│ ├── ...
│── /workflows # validates and calls side car headless comfyui (experimental, WIP)
│ ├── ...
│── /common # ✅ Shared components
│── /utils # ✅ General-purpose utilities (helpers, formatters, etc.)
│── /tests # ✅ Tests mirror the /workers structure
│── worker.py # ✅ Celery
│── pytest.ini # ✅ Test configuration
/clients
│── /it_tests
│ ├── generated/ # generated client
│ ├── tests/
│── /houdini
│ ├── python/generated/ # generated client
│── /nuke
│ ├── python/generated/ # generated client
│── openapi.json # API spec
Example clients for Houdini and Nuke are provided in the /clients directory.
See clients/README.md for detailed setup instructions.
Example Agentic layer which is a bit experimental that demonstrates connection to the MCP server.
See agentic/README.md for more information.
User-facing model choices are simple names like "flux-1" or "flux-1-pro". The actual model calls and implementations are defined in the worker pipeline. Worker tasks follow these user-driven names but may share common logic for variants.
For example, "flux-1" might internally use:
- "black-forest-labs/FLUX.1-Krea-dev"
- "black-forest-labs/FLUX.1-Kontext-dev"
- "black-forest-labs/FLUX.1-Fill-dev"
Depending on the inputs (e.g., whether an image is provided), we internally route to the most appropriate model variant.
We avoid cluttering user model choices with minor versions (.1, .2, etc.) and instead select the best available minor version. This approach allows us to properly test and verify model behaviors for both external and local models without requiring users to understand implementation details.
The model pipelines themselves serve as the source of truth for what models are actually used. This is especially important given various optimizations and edge cases that may apply.
Model definitions are version-controlled in code, not loaded dynamically from configuration files. We match celery task names to the module names for clarity.
This design choice ensures:
- Full test coverage and deterministic behavior across releases
- Stable API contracts between
/apiand/workers - Clear traceability between user-facing model identifiers and their actual implementations
Developers who want to extend or modify available models can do so by editing the typed definitions directly in code:
api/images/schemas.pyworkers/images/tasks.pyworkers/images/local/...
Each new model entry should include:
- A Pydantic schema entry in
ModelName - A corresponding task or pipeline implementation
- Updated tests under
tests/images
This deliberate coupling between model definitions, pipelines, and tests is what makes deferred-diffusion reliable and reproducible for self-hosted AI inference.