Voice Chat

Private Voice Interaction

Guaardvark's voice chat gives you a complete spoken conversation loop with your local AI — and none of it ever touches the internet. Every syllable you speak is transcribed on your own CPU. Every word the AI speaks back is synthesized on your own hardware. There is no cloud relay, no remote API, no third-party recording service sitting between you and your assistant.

Real-Time Speech-to-Text with Whisper.cpp

At the heart of voice input is Whisper.cpp, the highly optimized C++ port of OpenAI's Whisper model. Unlike cloud-based transcription services that stream your audio to distant servers, Whisper.cpp runs entirely locally. Your microphone audio is captured, chunked, and fed directly into the model running on your machine. Transcription happens in real time with impressive accuracy across dozens of languages, and the resulting text is handed off to the LLM for processing. Because Whisper.cpp is compiled to native code and optimized for CPU inference, it achieves fast transcription speeds even without a dedicated GPU — making it well-suited for laptops, edge devices, and air-gapped workstations alike.

Natural Text-to-Speech with Piper TTS

Once the LLM has generated its response, Piper TTS converts that text into natural-sounding spoken audio. Piper is a fast neural text-to-speech engine that supports multiple voice models, each trained to produce clear and expressive speech at near real-time speeds. You can choose from a variety of voices — male, female, different accents and speaking styles — and switch between them on the fly. Piper runs entirely on your local hardware, so the audio never leaves your device. The result is a voice assistant that sounds genuinely natural without the privacy cost of a cloud-hosted synthesis engine.

The Full Conversation Flow

The voice chat pipeline is seamless: you speak into your microphone, Whisper.cpp transcribes your speech to text, the local LLM processes your query and generates a response, and Piper TTS reads that response back to you aloud. The entire loop — speak, transcribe, reason, synthesize — executes on your hardware with no network round-trip. Latency is measured in milliseconds rather than the seconds you would experience waiting for a cloud API to respond. And because every component is local, the system works identically whether you are connected to the internet, working on a plane, or operating inside a classified network with no outbound access whatsoever.

Wake Word Detection

Guaardvark supports always-listening wake word detection so you can activate voice chat hands-free. Simply say "Hey Guaardvark" and the system springs to life, ready to accept your spoken query. This is not a cloud-based wake word service that streams ambient audio to remote servers — the wake word listener runs as a lightweight local process that monitors your microphone input for the activation phrase and nothing else.

Hands-Free Operation

Wake word activation is ideal for environments where your hands are occupied — workshops, labs, kitchens, server rooms, or anywhere you need quick access to information without reaching for a keyboard. Ask a question while soldering a circuit board, request a recipe conversion while your hands are covered in flour, or check system status while physically inspecting hardware. The voice interface makes your AI assistant accessible in situations where a traditional keyboard-and-mouse interaction would be impractical or impossible.

Configurable Sensitivity

The wake word listener offers configurable sensitivity to balance responsiveness against false activations. In a quiet home office you can set sensitivity high for effortless activation. In a noisy workshop or shared space, lower the sensitivity so the system only responds to deliberate, clearly spoken wake phrases. The detection engine runs as a low-resource background process that consumes minimal CPU — it will not interfere with your GPU-intensive model inference, image generation, or other Guaardvark workloads running on the same machine.

Voice-Controlled Features

Voice chat is not limited to simple question-and-answer interactions. Guaardvark exposes a range of platform features through voice commands, turning your spoken words into powerful system actions.

Chat & Q&A

Ask questions in natural language and receive spoken answers from your local LLM. Carry on multi-turn conversations, request explanations, brainstorm ideas, or get summaries of complex topics — all through voice alone.

Media Control

Control media playback with your voice. Play, pause, skip tracks, adjust volume, or queue up content without touching a mouse. Ideal for background music while working or reviewing generated audio content.

System Commands

Launch platform features, check GPU status, start image generation jobs, or query system health — all by speaking. Voice commands integrate with Guaardvark's internal API so you can operate the platform entirely hands-free.

Dictation

Use continuous voice-to-text for document creation. Dictate notes, draft emails, write documentation, or compose long-form content. Whisper.cpp transcribes your speech in real time with punctuation and formatting support.

Architecture

The voice chat system is built on a stack of open-source, locally-running components carefully integrated for performance and reliability. Each piece of the pipeline has been chosen for its ability to run efficiently on consumer hardware without sacrificing quality.

Whisper.cpp

Whisper.cpp is the efficient C++ port of OpenAI's Whisper automatic speech recognition model. It is optimized for CPU-based inference with support for AVX2, NEON, and other SIMD instruction sets, enabling fast transcription even on machines without a dedicated GPU. Multiple model sizes are available — from the tiny model for ultra-fast transcription on constrained devices, to the large-v3 model for maximum accuracy across languages. Guaardvark handles model selection automatically based on your available hardware resources.

Piper TTS

Piper TTS is a fast, local neural text-to-speech system that produces high-quality speech from text input. It supports dozens of voices across multiple languages, each packaged as a compact ONNX model that runs efficiently on CPU. Voice models range from 15MB to 75MB, making them practical to store and swap even on devices with limited disk space. Synthesis latency is typically under 200 milliseconds for short utterances, keeping the conversational flow natural and responsive.

WebSocket Streaming

Audio data flows between the browser interface and the backend voice services over WebSocket connections, enabling low-latency bidirectional streaming. Microphone audio is streamed to Whisper.cpp in real time, and synthesized speech is streamed back to the browser for immediate playback — no file uploads, no polling, no waiting for complete audio files to transfer.

FFmpeg Integration

FFmpeg handles audio preprocessing and format conversion behind the scenes. Incoming microphone audio is normalized, resampled to the format Whisper.cpp expects (16kHz mono WAV), and cleaned up for optimal transcription accuracy. Outgoing TTS audio is encoded to the appropriate format for browser playback. FFmpeg's battle-tested reliability ensures consistent audio quality regardless of the input device or browser being used.

Why Local Voice Matters

Voice data is among the most personal information you can produce. Your voice carries biometric signatures, emotional patterns, and intimate conversational content. Sending that data to a cloud service means trusting a third party with something deeply private — and hoping they handle it responsibly. Guaardvark eliminates that trust requirement entirely.

With local voice processing, your audio never leaves your device. There is no remote server receiving your microphone feed. There is no recording stored on someone else's infrastructure. There is no possibility of a data breach exposing your voice recordings, because those recordings simply do not exist outside your own machine. For organizations handling sensitive information — legal conversations, medical discussions, classified briefings — this is not a convenience feature. It is a requirement.

Local voice also means no subscription fees. Cloud-based speech-to-text and TTS services typically charge per minute of audio processed or per character synthesized. Those costs add up quickly for heavy users — researchers transcribing hours of interviews, customer support teams processing thousands of calls, or individuals who simply prefer to interact with their AI by voice throughout the day. Guaardvark's voice features have no per-use cost. Once the models are downloaded, you can transcribe and synthesize as much audio as your hardware can handle.

And crucially, local voice works without internet. Field researchers in remote locations, travelers on flights without Wi-Fi, military personnel in forward operating bases, and enterprises operating in restricted network environments all need voice AI that functions regardless of connectivity. Guaardvark's voice chat operates identically whether your machine is connected to the internet or completely air-gapped. There is no degraded mode, no fallback to text-only, and no apologetic "voice features unavailable offline" message. The full voice pipeline works everywhere your hardware goes.

Finally, there are no usage limits on transcription or synthesis. Cloud providers impose rate limits, daily quotas, and usage caps that throttle your workflow at the worst possible moments. With Guaardvark, the only limit is your hardware's processing capacity. Transcribe an entire day's worth of meetings. Generate hours of spoken content. Run the voice assistant continuously as a background service. The system is yours to use without restriction.

Private Voice Interaction

Real-Time Speech-to-Text with Whisper.cpp

Natural Text-to-Speech with Piper TTS

The Full Conversation Flow

Wake Word Detection

Hands-Free Operation

Configurable Sensitivity

Voice-Controlled Features

Chat & Q&A

Media Control

System Commands

Dictation

Architecture

Whisper.cpp

Piper TTS

WebSocket Streaming

FFmpeg Integration

Why Local Voice Matters

Ready to Talk?

VOICE CHAT

Private Voice Interaction

Real-Time Speech-to-Text with Whisper.cpp

Natural Text-to-Speech with Piper TTS

The Full Conversation Flow

Wake Word Detection

Hands-Free Operation

Configurable Sensitivity

Voice-Controlled Features

Chat & Q&A

Media Control

System Commands

Dictation

Architecture

Whisper.cpp

Piper TTS

WebSocket Streaming

FFmpeg Integration

Why Local Voice Matters

Ready to Talk?

Explore More Features