Technical

ARCHITECTURE

Production-grade full-stack design built for reliability, extensibility, and offline operation. Every layer of Guaardvark — from backend APIs to GPU-accelerated inference — is engineered to run entirely on your hardware without external dependencies.

System Overview

Guaardvark is a full-stack AI application with clear separation between its backend, frontend, and AI/ML layers. Rather than stitching together a collection of loosely coupled scripts and notebooks, Guaardvark is structured as a cohesive production system where each layer has well-defined responsibilities and communicates through stable interfaces. The result is a platform that is straightforward to deploy, predictable to operate, and practical to extend.

The architecture is designed around three guiding principles. First, reliability: every component in the stack has been selected for battle-tested stability in production environments. Flask, PostgreSQL, Redis, React — these are not experimental technologies. They power millions of applications worldwide and have well-understood operational characteristics. Second, extensibility: the modular blueprint architecture on the backend and component-based design on the frontend make it straightforward to add new features without modifying existing code paths. Third, offline operation: Guaardvark is built from the ground up to function without any internet connection. There are no hidden calls to external APIs, no cloud-based authentication services, and no telemetry that phones home. Once installed, the entire platform runs on your local machine or network.

This document walks through each layer of the architecture in detail, from the Python backend that handles API routing and business logic, through the React frontend that provides the user interface, to the AI and ML layer that orchestrates model inference across LLMs, vision models, audio processors, and video generators.

Backend — Python & Flask

The backend is built on Flask, a mature Python web framework that provides the right balance of flexibility and structure for a platform of this scope. Flask handles HTTP request routing, session management, authentication, and serves as the integration point for every subsystem in the platform. The backend exposes over 70 API endpoints covering user management, document operations, chat sessions, model configuration, agent execution, image generation, video processing, audio transcription, and administrative controls.

Data persistence is managed through SQLAlchemy ORM with Alembic for database migrations against PostgreSQL. SQLAlchemy provides a clean abstraction over raw SQL while still allowing complex queries when needed. Alembic ensures that schema changes are versioned and reversible, making upgrades and rollbacks predictable. The data model covers users, documents, chat conversations, agent sessions, generation history, settings, and system configuration.

Beyond the standard REST API, Guaardvark exposes a GraphQL API powered by Ariadne. GraphQL is used for operations where the frontend needs flexible, nested data retrieval — for instance, fetching a chat session along with its messages, tool calls, and associated documents in a single request. This reduces round-trips and gives the frontend precise control over the shape of the data it receives.

Asynchronous workloads — model inference, video generation, batch document processing, and agent execution — are handled by Celery with Redis as the message broker. Celery workers pick up tasks from the queue and execute them in the background, keeping the main Flask process responsive. Task status, progress updates, and results are communicated back to the frontend through WebSocket connections managed by Socket.IO. This enables real-time streaming of LLM output, generation progress bars, and live agent activity feeds without polling.

The codebase follows a modular blueprint architecture. Each major feature area — chat, agents, image generation, video generation, documents, audio, settings — is organized as a Flask blueprint with its own routes, services, and data access layer. This separation of concerns means that adding a new feature does not require touching existing modules, and individual blueprints can be tested and deployed independently. Shared utilities like authentication middleware, error handling, and logging are centralized in common modules that all blueprints import.

Frontend — React & Vite

The frontend is a React single-page application built with Vite for fast development iteration and optimized production builds. Vite's module-based dev server provides near-instant hot module replacement during development, while its Rollup-based build pipeline produces efficiently chunked bundles for production. The result is a responsive development experience and fast page loads for end users.

The design system is built on Material-UI (MUI), which provides a comprehensive library of pre-built, accessible components that follow consistent design patterns. MUI's theming system allows Guaardvark to maintain a unified visual language across all 28 interface pages while customizing colors, typography, and spacing to match the platform's identity. The component library includes over 100 reusable components built on top of MUI primitives, covering everything from chat message bubbles and agent activity timelines to image generation controls and video preview players.

State management uses Zustand, a lightweight alternative to Redux that provides global state stores without the ceremony of actions, reducers, and middleware chains. Zustand stores manage application-wide state like user sessions, active model configurations, and notification queues. For server-fetched data, Apollo Client handles GraphQL queries with built-in caching, optimistic updates, and automatic refetching when dependencies change.

One of the more distinctive frontend integrations is the Monaco Editor — the same editor engine that powers Visual Studio Code. Monaco is embedded in Guaardvark's code review interface, providing syntax highlighting, diff views, code folding, and IntelliSense-style completions for reviewing AI-generated code changes. This gives users a familiar, professional-grade editing experience directly within the platform rather than requiring them to switch to an external editor.

AI & ML Layer

The AI and ML layer is where Guaardvark's core intelligence capabilities live. Rather than relying on a single model or inference backend, the platform integrates multiple specialized engines that each handle a distinct modality. This multi-engine approach means that the best tool is always used for the job — a dedicated LLM runtime for text, specialized diffusion models for images, purpose-built codecs for audio, and optimized pipelines for video.

LLM Integration

Ollama for local model serving, LlamaIndex for retrieval-augmented generation, and support for multiple model families including Llama, Mistral, Qwen, Gemma, and Phi. Models are loaded on demand and unloaded when idle to conserve GPU memory.

Computer Vision

Stable Diffusion for image generation, CLIP for image-text understanding and semantic search, and Real-ESRGAN for image upscaling and enhancement. Vision pipelines run through a unified interface that abstracts model differences.

Audio Processing

Whisper.cpp for fast speech-to-text transcription, Piper TTS for natural text-to-speech synthesis, and FFmpeg for audio format conversion, normalization, and stream extraction. Audio pipelines support real-time streaming input and output.

Video Generation

ComfyUI for workflow-based generation pipelines, Wan2.2 and CogVideoX for text-to-video and image-to-video synthesis, and RIFE for frame interpolation and smooth slow-motion effects. Video tasks are queued to prevent GPU contention.

All AI and ML operations are coordinated through a central inference manager that tracks which models are currently loaded, how much GPU memory is available, and which tasks are queued or in progress. This coordination layer prevents the kind of out-of-memory crashes and resource conflicts that plague ad-hoc AI setups where multiple models compete for the same GPU.

Data Layer

PostgreSQL serves as the primary persistent data store, holding user accounts, document metadata, chat conversation histories, agent session logs, generation records, and system configuration. PostgreSQL was chosen for its reliability, ACID compliance, and mature support for advanced features like JSON columns, full-text search, and array types that Guaardvark uses extensively. The schema is designed for clean relational modeling with appropriate indexes for the query patterns the application uses most frequently.

Redis fills three distinct roles in the architecture. First, it serves as a fast in-memory cache for frequently accessed data like user sessions, model configuration lookups, and recently generated content thumbnails. Second, it acts as the task broker for Celery, queuing asynchronous jobs and distributing them to worker processes. Third, it provides pub/sub messaging for real-time features — when a Celery worker completes a generation task or an agent produces new output, it publishes an event through Redis that Socket.IO picks up and pushes to connected frontend clients.

Media assets — generated images, rendered videos, uploaded documents, and audio files — are stored on the local file system in an organized directory structure. File paths are tracked in PostgreSQL alongside their metadata (dimensions, duration, format, generation parameters), providing a clean separation between binary content and structured data. Vector embeddings for RAG search are stored alongside document metadata, enabling semantic similarity queries over your document collection without requiring a separate vector database service.

GPU Management

Running multiple AI models on a single GPU requires careful resource management. Guaardvark includes a dedicated GPU management layer that handles model lifecycle, memory budgeting, and task coordination automatically.

At startup, the system performs automatic CUDA GPU detection, identifying available GPUs, their VRAM capacity, compute capability, and current utilization. This information feeds into the VRAM budget system, which maintains a running tally of how much memory is allocated to loaded models and how much headroom remains. Before loading a new model, the system checks whether sufficient VRAM is available. If not, it identifies the least-recently-used model and unloads it to free space. This intelligent loading and unloading prevents the out-of-memory crashes that are the most common failure mode in self-hosted AI setups.

The GPU management layer also coordinates between concurrent tasks. When an LLM inference is running, the system knows how much memory it is consuming and can defer an image generation request until the inference completes and releases its allocation. This coordination is especially important for consumer GPUs with limited VRAM, where running two large models simultaneously would exhaust memory. Tasks are prioritized based on type and user interaction — real-time chat responses take priority over background batch processing, ensuring the interface remains responsive even when the system is under heavy load.

Agent Framework

Guaardvark's agent system implements the ReACT (Reason + Act) loop, a proven architecture for autonomous AI agents that decompose complex tasks into sequences of reasoning steps and tool invocations. The agent receives a task, reasons about what information it needs, selects a tool from the registry, executes it, observes the result, and repeats until the task is complete.

The framework is built around a pluggable tool registry with over 30 registered tools spanning web research, file operations, code analysis, content generation, and system utilities. Each tool is defined by a schema that describes its name, purpose, expected inputs, and output format. Agents consult this registry at each reasoning step to select the most appropriate tool for their current sub-task. Adding new tools is as simple as dropping a tool definition file into the registry directory — the framework discovers and loads new tools automatically at startup.

Agent memory and context management ensure that agents maintain awareness of their full execution history within a session. Every tool call, its arguments, and its results are recorded in structured context that the agent can reference in subsequent reasoning steps. This accumulated context allows agents to build on earlier findings, avoid redundant work, and make increasingly informed decisions as a workflow progresses.

For complex tasks that span multiple domains, task routing enables multi-agent collaboration. A primary agent can decompose a task into sub-tasks and delegate each to a specialized agent configured for that domain — a research agent for information gathering, a code agent for analysis, a writing agent for synthesis. Each sub-agent operates with its own context and tool set, and results flow back to the primary agent for final coordination and assembly.

Deployment Architecture

Guaardvark supports multiple deployment configurations to match different team sizes and infrastructure constraints.

The most common configuration is single-machine deployment, where the entire stack — Flask backend, React frontend, PostgreSQL, Redis, Celery workers, and AI models — runs on a single workstation or server. This is the simplest setup and is designed for individual users or small teams who want a self-contained AI platform on their own hardware. A single machine with a modern NVIDIA GPU (8 GB+ VRAM) can run the full platform comfortably.

For teams that need shared access, Guaardvark supports multi-machine deployment via the Interconnector module. Interconnector allows multiple Guaardvark instances to discover each other on a local network, share model availability information, and route inference requests to the machine with the best available resources. This means a team of five users can each run a Guaardvark frontend on their workstation while sharing a central GPU server for heavy inference tasks.

Docker support is available for users who prefer containerized deployment. Official Docker Compose configurations bundle the full stack into a reproducible environment that can be stood up with a single command. For production deployments on bare metal or VMs, systemd service files are provided to manage the Flask application, Celery workers, and Redis as system services with automatic restart on failure, log rotation, and proper resource limits.

70+ Endpoints
28 Pages
30+ Agent Tools

Explore the Full Stack

Guaardvark is coming soon. Get notified when it launches, or explore the source on GitHub.

Get Notified View on GitHub