🔨 AgentStudio - Pseudo-Lab 11th AI Agent Project
"Bridging the intergenerational knowledge gap with AI and sharing positive influence."
Vision-Language-Action (VLA) Agent for Automated Kiosk Interaction
Kiosk Agent is an AI system that utilizes Vision-Language Models (VLM) to automatically control Android kiosk applications. It interprets visual interfaces and executes precise actions to assist users who may find digital kiosks challenging.
- Gemini-Powered Reasoning: Support for both
gemini-3-flash(high-speed) andgemini-3-pro(high-reasoning) models. - VLA Paradigm: Seamless workflow: Vision → Language → Action.
- AG-UI Protocol: Standardized agent-to-UI communication protocol via SSE.
- Multi-Framework Support: Built on LangGraph, with extensions for CrewAI and Google ADK.
- Human-in-the-Loop (HITL): Asks the user for input when subjective choices are required.
- Planning Mode: Decomposes complex requests into steps with real-time To-do tracking.
- Voice Interface: Supports TTS (CosyVoice3) and STT (Google Cloud).
- Real-time Dashboard: Live monitoring of agent status and screen interactions.
AgentStudio allows you to switch between different Vision-Language Models depending on your needs.
| Provider | Model | Status | Key Advantage |
|---|---|---|---|
gemini-3-flash |
✅ Supported | Low latency and cost-efficient | |
gemini-3-pro |
✅ Supported | Advanced reasoning for complex UI | |
| OpenAI | gpt-4o-mini |
✅ Supported | Robust performance across various tasks |
gemma-3-27b |
🔜 Roadmap | Optimized for on-device/local privacy | |
| Microsoft | Fara-7B |
🔜 Roadmap | Optimized Computed Ondevice Agent |
To switch models, update your .env file:
MODEL_PROVIDER=gemini
GEMINI_MODEL=gemini-3-flash # Options: gemini-3-flash, gemini-3-pro
The VLA paradigm is a continuous cycle where the agent observes, reasons, and executes.
flowchart LR
A[Screen Capture] --> B[VLM Reasoning]
B --> C[Action Decode]
C --> D[Execute ADB]
D --> E{Done?}
E -->|No| A
E -->|FINISH| F[Complete]
E -->|INTERRUPT| G[Human Input]
G --> A
| Phase | Description |
|---|---|
| Screen Capture | Captures Android device screen via ADB |
| VLM Reasoning | Gemini analyzes the screen to decide the next action |
| Action Decode | Parses VLM output into structured executable commands |
| Execute ADB | Controls the device using ADB (tap, swipe, input) |
| INTERRUPT | Triggers HITL when user intervention is required |
We manage the agent's logic flow using LangGraph for stable state transitions.
flowchart TD
START([Start]) --> VLM[VLM Node]
VLM --> EXEC[Execute Node]
EXEC --> ROUTER{Router}
ROUTER -->|LOOP| VLM
ROUTER -->|INTERRUPT| HUMAN[Human Node]
ROUTER -->|FINISH| END([End])
HUMAN -->|Resume| VLM
HUMAN -->|Abort| END
- Python: 3.10+ (3.11 recommended)
- Node.js: 18+ (for Dashboard)
- uv: Latest (Fast Python package manager)
- ADB: Android Debug Bridge installed
git clone [https://github.com/Pseudo-Lab/Agent_Studio.git](https://github.com/Pseudo-Lab/Agent_Studio.git)
cd Agent_Studio
# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate
# Install dependencies in editable mode
uv pip install -e backend/
cp .env.example .env
# Edit .env with your GOOGLE_API_KEY
| Action | Parameters | Description |
|---|---|---|
CLICK |
x, y |
Tap specific coordinates |
INPUT |
text |
Type text into a field |
SWIPE |
x1, y1, x2, y2 |
Scroll or navigate |
INTERRUPT |
question |
Ask user for guidance (HITL) |
FINISH |
- | Task completed successfully |
- LangGraph-based VLA Agent loop.
- Support for Gemini 3 Flash/Pro.
- Planning Mode & HITL system.
- Real-time Dashboard via AG-UI Protocol.
- Gemma Integration: Support for lightweight, on-device local models.
- Microsoft Agent Framework: Semantic Kernel & Azure AI Agent Service integration.
- Google ADK: Native Gemini Agent Framework support.
- CrewAI: Multi-agent collaboration workflows.
| Name | Role | Focus |
|---|---|---|
| Jaehyun Kim | Builder | Frontend (Next.js), Backend (FastAPI) |
| Seunghyeok Kim | Runner | LangGraph, Reasoning, Prompt Engineering |
| Gyumin Lee | Runner | VLA Mechanism, LangGraph Architecture |
| Minjung Jeon | Runner | Voice (TTS/STT), Google ADK |
This project is licensed under the Apache License 2.0.