Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/architecture/authentication.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ JWTs are signed with HS256 using a secret key from settings:
```

The token payload contains the username in the `sub` claim and an expiration time. Token lifetime is configured via
`ACCESS_TOKEN_EXPIRE_MINUTES` (default: 30 minutes).
`ACCESS_TOKEN_EXPIRE_MINUTES` (default: 24 hours / 1440 minutes).

### CSRF Validation

Expand Down
12 changes: 10 additions & 2 deletions docs/architecture/overview.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
# Architecture overview

In this file, you can find broad description of main components of project architecture.
Preciser info about peculiarities of separate components (SSE, Kafka topics, DLQ, ..) are in
the [Components](../components/dead-letter-queue.md) section.
For details on specific components, see:

- [SSE Architecture](../components/sse/sse-architecture.md)
- [Dead Letter Queue](../components/dead-letter-queue.md)
- [Kafka Topics](kafka-topic-architecture.md)
- [Workers](../components/workers/pod_monitor.md)

!!! note "Event Streaming"
Kafka event streaming is disabled by default (`ENABLE_EVENT_STREAMING=false`). Set this to `true` in your
environment to enable the full event-driven architecture.

## System overview

Expand Down
16 changes: 12 additions & 4 deletions docs/architecture/rate-limiting.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,10 @@ The platform ships with default rate limits organized by endpoint group. Higher
Execution endpoints have the strictest limits since they spawn Kubernetes pods. The catch-all API rule (priority 1)
applies to any endpoint not matching a more specific pattern.

!!! note "WebSocket rule"
The `/api/v1/ws` pattern is reserved for future WebSocket support. The platform currently uses Server-Sent Events
(SSE) for real-time updates via `/api/v1/events/*`.

## Middleware Integration

The `RateLimitMiddleware` intercepts all HTTP requests, extracts the user identifier, and checks against the configured
Expand Down Expand Up @@ -135,10 +139,14 @@ Configuration is cached in Redis for 5 minutes to reduce database load while all

Rate limiting is controlled by environment variables:

| Variable | Default | Description |
|---------------------------|--------------|---------------------------------------|
| `RATE_LIMIT_ENABLED` | `true` | Enable/disable rate limiting globally |
| `RATE_LIMIT_REDIS_PREFIX` | `ratelimit:` | Redis key prefix for isolation |
| Variable | Default | Description |
|---------------------------|------------------|------------------------------------------------------|
| `RATE_LIMIT_ENABLED` | `true` | Enable/disable rate limiting globally |
| `RATE_LIMIT_REDIS_PREFIX` | `rate_limit:` | Redis key prefix for isolation |
| `RATE_LIMIT_ALGORITHM` | `sliding_window` | Algorithm to use (`sliding_window` or `token_bucket`)|
| `RATE_LIMIT_DEFAULT_REQUESTS` | `100` | Default request limit |
| `RATE_LIMIT_DEFAULT_WINDOW` | `60` | Default window in seconds |
| `RATE_LIMIT_BURST_MULTIPLIER` | `1.5` | Burst multiplier for token bucket |

The system gracefully degrades when Redis is unavailable—requests are allowed through rather than failing closed.

Expand Down
49 changes: 38 additions & 11 deletions docs/architecture/runtime-registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,23 +77,50 @@ The example scripts intentionally use features that may not work on older versio
compatibility. For instance, Python's match statement (3.10+), Node's `Promise.withResolvers()` (22+), and Go's
`clear()` function (1.21+).

## API Endpoint
## API Endpoints

The `/api/v1/languages` endpoint returns the available runtimes:
The runtime information is available via two endpoints:

### GET /api/v1/k8s-limits

Returns resource limits and supported runtimes:

```json
{
"cpu_limit": "1000m",
"memory_limit": "128Mi",
"cpu_request": "1000m",
"memory_request": "128Mi",
"execution_timeout": 300,
"supported_runtimes": {
"python": {"versions": ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"], "file_ext": "py"},
"node": {"versions": ["18", "20", "22"], "file_ext": "js"},
"ruby": {"versions": ["3.1", "3.2", "3.3"], "file_ext": "rb"},
"go": {"versions": ["1.20", "1.21", "1.22"], "file_ext": "go"},
"bash": {"versions": ["5.1", "5.2", "5.3"], "file_ext": "sh"}
}
}
```

### GET /api/v1/example-scripts

Returns example scripts for each language:

```json
{
"python": {"versions": ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"], "file_ext": "py"},
"node": {"versions": ["18", "20", "22"], "file_ext": "js"},
"ruby": {"versions": ["3.1", "3.2", "3.3"], "file_ext": "rb"},
"go": {"versions": ["1.20", "1.21", "1.22"], "file_ext": "go"},
"bash": {"versions": ["5.1", "5.2", "5.3"], "file_ext": "sh"}
"scripts": {
"python": "# Python example script...",
"node": "// Node.js example script...",
"ruby": "# Ruby example script...",
"go": "package main...",
"bash": "#!/bin/bash..."
}
}
```

## Key Files

| File | Purpose |
|----------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|
| [`runtime_registry.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/runtime_registry.py) | Language specifications and runtime config generation |
| [`api/routes/languages.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/api/routes/languages.py) | API endpoint for available languages |
| File | Purpose |
|----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|
| [`runtime_registry.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/runtime_registry.py) | Language specifications and runtime config generation |
| [`api/routes/execution.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/api/routes/execution.py) | API endpoints including k8s-limits and example-scripts |
8 changes: 8 additions & 0 deletions docs/components/dead-letter-queue.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,11 @@ If sending to DLQ fails (extremely rare - would mean Kafka is down), the produce

The system is designed to be resilient but not perfect. In catastrophic scenarios, you still have Kafka's built-in durability and the ability to replay topics from the beginning if needed.

## Key files

| File | Purpose |
|--------------------------------------------------------------------------------------------------------------------------|------------------------|
| [`dlq_processor.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/workers/dlq_processor.py) | DLQ processor worker |
| [`manager.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/dlq/manager.py) | DLQ management logic |
| [`unified_producer.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/events/unified_producer.py) | `send_to_dlq()` method |
| [`dlq.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/api/routes/dlq.py) | Admin API routes |
7 changes: 7 additions & 0 deletions docs/components/saga/resource-allocation.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,3 +104,10 @@ if active_count >= 100: # <- adjust this value
```

Future improvements could make this configurable per-language or dynamically adjustable based on cluster capacity.

## Key files

| File | Purpose |
|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|
| [`execution_saga.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/saga/execution_saga.py) | Saga with allocation step |
| [`resource_allocation_repository.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/db/repositories/resource_allocation_repository.py) | MongoDB operations |
8 changes: 8 additions & 0 deletions docs/components/schema-manager.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,11 @@ During API startup, the `lifespan` function in `dishka_lifespan.py` gets the dat
To force a specific MongoDB migration to run again, delete its document from `schema_versions`. To start fresh, point the app at a new database. Migrations are designed to be additive; the system doesn't support automatic rollbacks. If you need to undo a migration in production, you'll have to drop indexes or modify validators manually.

For Kafka schemas, the registry keeps all versions. If you break compatibility and need to start over, delete the subject from the registry (either via REST API or the registry's UI if available) and let the app re-register on next startup.

## Key files

| File | Purpose |
|--------------------------------------------------------------------------------------------------------------------------------|----------------------------|
| [`schema_manager.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/db/schema/schema_manager.py) | MongoDB migrations |
| [`schema_registry.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/events/schema/schema_registry.py) | Kafka Avro serialization |
| [`dishka_lifespan.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/dishka_lifespan.py) | Startup initialization |
7 changes: 7 additions & 0 deletions docs/components/sse/execution-sse-flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,10 @@ The SSE router maintains a small pool of Kafka consumers and routes only the eve
Using `result_stored` as the terminal signal removes artificial waiting. Earlier iterations ended the SSE stream on `execution_completed`/`failed`/`timeout` and slept on the server to "give Mongo time" to commit. That pause is unnecessary once the stream ends only after the result processor confirms persistence.

This approach preserves clean attribution and ordering. The coordinator enriches pod creation commands with user information so pods are labeled correctly. The pod monitor converts Kubernetes phases into domain events. Timeout classification is deterministic: any pod finishing with `reason=DeadlineExceeded` results in an `execution_timeout` event. The result processor is the single writer of terminal state, so the UI never races the database — when the browser sees `result_stored`, the result is already present.

## Related docs

- [SSE Architecture](sse-architecture.md) — overall SSE design and components
- [SSE Partitioned Router](sse-partitioned-architecture.md) — consumer pool and scaling
- [Result Processor](../workers/result_processor.md) — terminal event handling
- [Pod Monitor](../workers/pod_monitor.md) — Kubernetes event translation
9 changes: 9 additions & 0 deletions docs/components/sse/sse-architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,12 @@ Key metrics: `sse.connections.active`, `sse.messages.sent.total`, `sse.connectio
## Why not WebSockets?

WebSockets were initially implemented but removed because SSE is sufficient for server-to-client communication, simpler connection management, better proxy compatibility (many corporate proxies block WebSockets), excellent browser support with automatic reconnection, and works great with HTTP/2 multiplexing.

## Key files

| File | Purpose |
|-----------------------------------------------------------------------------------------------------------------------------------|------------------------|
| [`sse_service.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/sse/sse_service.py) | Client connections |
| [`kafka_redis_bridge.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/sse/kafka_redis_bridge.py) | Kafka-to-Redis routing |
| [`redis_bus.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/sse/redis_bus.py) | Redis pub/sub wrapper |
| [`sse_shutdown_manager.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/sse/sse_shutdown_manager.py) | Graceful shutdown |
7 changes: 7 additions & 0 deletions docs/components/sse/sse-partitioned-architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,10 @@ Memory management uses configurable buffer limits: max size, TTL for expiration,
The `SSEShutdownManager` complements the router. The router handles the data plane (Kafka consumers, event routing to Redis). The shutdown manager handles the control plane (tracking connections, coordinating graceful shutdown, notifying clients).

When the server shuts down, SSE clients receive a shutdown event so they can display messages and attempt reconnection. The shutdown manager implements phased shutdown: notify clients, wait for graceful disconnection, force-close remaining connections. SSE connections register with the shutdown manager and monitor a shutdown event while streaming from Redis. When shutdown triggers, connections send shutdown messages and close gracefully.

## Key files

| File | Purpose |
|-----------------------------------------------------------------------------------------------------------------------------------|----------------------|
| [`kafka_redis_bridge.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/sse/kafka_redis_bridge.py) | Consumer pool router |
| [`sse_shutdown_manager.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/sse/sse_shutdown_manager.py) | Connection tracking |
74 changes: 74 additions & 0 deletions docs/components/workers/coordinator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Coordinator

The coordinator owns admission and queuing policy for executions. It decides which executions can proceed based on
available resources and enforces per-user limits to prevent any single user from monopolizing the system.

```mermaid
graph LR
Kafka[(Kafka)] --> Coord[Coordinator]
Coord --> Queue[Priority Queue]
Coord --> Resources[Resource Pool]
Coord --> Accepted[Accepted Events]
Accepted --> Kafka
```

## How it works

When an `ExecutionRequested` event arrives, the coordinator checks:

1. Is the queue full? (max 10,000 pending)
2. Has this user exceeded their limit? (max 100 concurrent)
3. Are there enough CPU and memory resources?

If all checks pass, the coordinator allocates resources and publishes `ExecutionAccepted`. Otherwise, the request
is either queued for later or rejected.

The coordinator runs a background scheduling loop that continuously pulls from the priority queue and attempts to
schedule pending executions as resources become available.

## Priority queue

Executions are processed in priority order. Lower numeric values are processed first:

```python
--8<-- "backend/app/services/coordinator/queue_manager.py:14:19"
```

When resources are unavailable, executions are requeued with reduced priority to prevent starvation.

## Resource management

The coordinator tracks a pool of CPU and memory resources:

| Parameter | Default | Description |
|---------------------------|---------|----------------------------|
| `total_cpu_cores` | 32 | Total CPU pool |
| `total_memory_mb` | 65,536 | Total memory pool (64GB) |
| `overcommit_factor` | 1.2 | Allow 20% overcommit |
| `max_queue_size` | 10,000 | Maximum pending executions |
| `max_executions_per_user` | 100 | Per-user limit |
| `stale_timeout_seconds` | 3,600 | Stale execution timeout |

## Topics

- **Consumes**: `execution_events` (requested, completed, failed, cancelled)
- **Produces**: `execution_events` (accepted)

## Key files

| File | Purpose |
|--------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
| [`run_coordinator.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/workers/run_coordinator.py) | Entry point |
| [`coordinator.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/coordinator/coordinator.py) | Main coordinator service |
| [`queue_manager.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/coordinator/queue_manager.py) | Priority queue implementation |
| [`resource_manager.py`](https://github.com/HardMax71/Integr8sCode/blob/main/backend/app/services/coordinator/resource_manager.py) | Resource pool and allocation |

## Deployment

```yaml
coordinator:
build:
dockerfile: workers/Dockerfile.coordinator
```

Usually runs as a single replica. Leader election via Redis is available if scaling is needed.
Loading