GPU Management — Dynamic VRAM & Conflict Detection

Optimized for Local Hardware

Running high-performance AI models locally requires careful coordination of limited hardware resources, particularly Video RAM (VRAM). Unlike cloud environments where resources can be scaled elastically, a local workstation or server has a fixed ceiling. Guaardvark's GPU Resource Coordinator acts as an intelligent traffic controller, ensuring that different parts of the AI stack — like LLMs, image generators, and video pipelines — don't compete for the same memory at the same time.

At the core of the system is Dynamic VRAM Budgeting. Guaardvark continuously monitors your GPU's memory usage through native `nvidia-smi` integration. It understands the memory footprint of different models and operations. When you start a video generation task that requires 14GB of VRAM, the coordinator can automatically signal the LLM (running in Ollama) to offload weights to system RAM or temporarily unload, preventing an Out of Memory (OOM) error that would otherwise crash the entire application.

Real-Time VRAM Monitoring

The Guaardvark dashboard features a live GPU Status Bar that provides instant visibility into your hardware's health. You can see total VRAM capacity, current utilization percentage, and a breakdown of which services are consuming memory. This transparency is crucial for developers and researchers who need to understand how different model configurations impact system performance.

Automatic Conflict Detection

One of the biggest challenges in local AI is "GPU collision" — two different libraries (like PyTorch and Ollama) trying to claim the same GPU device simultaneously without awareness of each other. Guaardvark solves this with a centralized Plugin Lock System. If you are using ComfyUI for video generation, Guaardvark ensures it has exclusive high-priority access to the GPU, automatically managing the lifecycle of background embedding services and LLM workers to keep the system stable.

Key Infrastructure Features

Workload Routing: Intelligently routes tasks to the GPU or CPU based on the model size and available resources.
Thermal Monitoring: Keeps an eye on GPU temperatures and utilization to prevent throttling or hardware stress during long-running batch jobs.
Multi-GPU Support: If your system has multiple cards, Guaardvark can distribute different workloads (e.g., RAG embeddings on GPU 0, LLM on GPU 1) to maximize throughput.
Runtime Model Switching: Switch between lightweight and heavy models on the fly; the GPU manager handles the unloading and loading process safely.

VRAM Budget Tracking

Real-time monitoring of GPU memory allocation across all AI workloads. See exactly what's consuming your VRAM.

Conflict Detection

Automatic detection when Ollama, ComfyUI, and embedding services compete for GPU resources. Prevents OOM crashes before they happen.

Workload Routing

Intelligently route inference tasks to available GPU capacity. Prioritize time-sensitive generation over background indexing.

Multi-GPU Support

Detect and manage multiple CUDA GPUs. Distribute workloads across cards for maximum throughput.

Optimized for Local Hardware

Real-Time VRAM Monitoring

Automatic Conflict Detection

Key Infrastructure Features

VRAM Budget Tracking

Conflict Detection

Workload Routing

Multi-GPU Support

Maximize Your GPU Investment

Related Features

Video Generation

Image Generation

Self-Improvement