Feature · Workspace
Documents
Upload PDFs, Markdown files, code, and transcripts. They're automatically parsed, chunked, embedded, and indexed into a LlamaIndex vector store so the agent can retrieve them semantically in any future conversation.
Multi-Format Parsing
The enhanced_file_processor.py utility handles PDF, DOCX, Excel, CSV, HTML, Markdown, and plain text. PDFs are text-extracted with layout preservation. Excel and CSV files are converted to structured text. Code files are indexed with syntax awareness. Whatever format your source material is in, it enters the vector store as clean, retrievable text.
Hybrid Retrieval
The hybrid_rag_pipeline.py combines BM25 keyword search with vector semantic search. BM25 excels at finding documents containing specific technical terms, version numbers, or proper nouns. Vector search excels at conceptual proximity. Combining them produces retrieval that's both precise and semantically aware — consistently outperforming either approach alone.
Three Indexing Strategies
Choose from simple indexing (standard chunked vector indexing), entity indexing (extracts named entities and indexes their relationships), or metadata indexing (enriches chunks with document-level metadata). Entity indexing is useful for legal and scientific documents where relationships between named concepts matter. Metadata indexing is useful when you need to filter by author, date, or source.
Desktop-Metaphor UI
The Documents surface uses a desktop metaphor with draggable folder windows (DocumentsDesktop), a right-click context menu, breadcrumb navigation, and a grid of file and folder icons. It behaves like a familiar file manager rather than a list of database records, making it comfortable for users who think in terms of folders and files rather than queries and indices.
What it does
Your knowledge base, automatically maintained
The Documents surface is the primary way to get external knowledge into Guaardvark's RAG pipeline. Upload a file and the indexing pipeline runs automatically: enhanced_file_processor.py parses the content, enhanced_rag_chunking.py splits it into overlapping chunks sized for the embedding model, embedding_router.py generates embeddings via Ollama (CPU) or the gpu_embedding plugin (GPU), and unified_index_manager.py stores the vectors in the LlamaIndex flat-file store at data/cache/. From that point on, the content is live in every chat session that uses RAG mode — no additional configuration required.
The chunking strategy is designed to maximise retrieval quality. enhanced_rag_chunking.py uses intelligent chunking with configurable overlap to ensure that context at chunk boundaries isn't lost. Chunk size is set based on the embedding model's optimal input length. The entity-based indexing strategy (entity_indexing_service.py) additionally extracts named entities and their relationships, building a graph structure alongside the vector index that enables more precise retrieval for knowledge-dense documents like legal contracts or scientific literature.
The bulk import capability handles large document sets efficiently. The BulkImportDocumentsPage and bulk_import_service accept a folder path or a ZIP archive and process the contents as a Celery job with progress tracking. For teams migrating a large document corpus from another tool, this means the entire migration can be queued as a background task and monitored from the dashboard without the UI locking up. The llx files upload command handles individual uploads from the terminal, including piped content from other tools.
Under the hood
Ingestion pipeline. File upload hits upload_api.py, which calls unified_upload_service.py to coordinate the pipeline. enhanced_file_processor.py handles format detection and text extraction. indexing_service.py is the primary entry point for LlamaIndex ingestion: it calls enhanced_rag_chunking.py for chunking, embedding_router.py for embeddings, and unified_index_manager.py for vector store writes. All indexing runs as a Celery task for large files; small files index synchronously and return immediately. The Document SQLAlchemy model records the file path, index status, chunk count, embedding model used, and indexing strategy, so you can see the indexing health of every document from the UI.
Query-time retrieval. When a RAG-mode chat message arrives, hybrid_rag_pipeline.py executes a parallel BM25 keyword search and vector semantic search, then merges and re-ranks results via advanced_retrieval_strategies.py (supporting MMR for diversity, HyDE for hypothetical document expansion, and reranking for precision). Results go through query_cache.py for a cache hit check before hitting the LLM. Cache hits return in milliseconds; cold queries complete in 300–800ms depending on index size and hardware. RAG debug endpoints (GET /api/rag-debug/) are available with GUAARDVARK_RAG_DEBUG=1 to inspect retrieval quality, chunk content, and reranking scores for any query.
# Upload a document via CLI
llx files upload technical_spec.pdf
# Trigger indexing with entity strategy
curl -X POST http://localhost:5000/api/indexing/index -H "Content-Type: application/json" -d '{"file_id": "doc_abc", "strategy": "entity"}'
# Semantic search across indexed documents
llx search "hybrid retrieval latency characteristics"
# Debug RAG retrieval for a query
GUAARDVARK_RAG_DEBUG=1 curl "http://localhost:5000/api/rag-debug/query?q=latency+p95"
Use cases
What Documents enables
Internal knowledge base for teams
Upload SOPs, technical documentation, product specs, and meeting notes. Team members chat with the knowledge base in natural language: "What's our incident response process for database outages?" retrieves the relevant SOP section and answers with a citation. No keyword hunting in a document tree — semantic questions get semantic answers, grounded in the actual documents.
Legal and contract analysis
Upload contracts, NDAs, or regulatory documents. Use entity indexing to capture relationships between parties, obligations, and terms. Ask the agent to find all clauses related to IP ownership, extract all termination conditions, or compare the indemnification language across two contracts. The agent retrieves the exact clauses and cites the source document and page number in its response.
Research literature management
Index a collection of academic PDFs or technical papers. Ask comparative questions that span the collection: "Which papers address the latency-accuracy trade-off in approximate nearest-neighbour search?" The hybrid retrieval pipeline surfaces relevant passages across all indexed papers, the LLM synthesises a cross-paper answer, and citations link back to the source PDFs. Academic research that used to require reading every paper manually becomes a conversational retrieval problem.
Guaardvark Documents vs. cloud RAG services
Cloud RAG services (Notion AI, Microsoft Copilot for Documents, ChatGPT with file upload) upload your documents to their servers. For proprietary business documents, legal filings, health records, or trade secrets, that's not acceptable. Guaardvark indexes and retrieves entirely on your hardware: documents never leave your network, the vector store lives in data/cache/ on your disk, and the LLM runs locally via Ollama. The hybrid BM25 + vector retrieval, advanced reranking strategies, and debug endpoints give you a level of visibility and control over retrieval quality that cloud services don't expose.
See full comparison →
FAQ
Documents — common questions
What file formats are supported?
PDF, DOCX, XLSX, CSV, HTML, Markdown (.md), plain text, and most code file extensions. The enhanced_file_processor.py auto-detects format from file content rather than extension, so mis-labelled files are handled correctly. Audio and video transcripts (from Whisper.cpp) are treated as plain text after transcription and indexed normally.
How long does indexing take?
Indexing speed depends on document size, embedding model, and hardware. A 10-page PDF with Ollama CPU embeddings indexes in approximately 5–15 seconds. With the gpu_embedding plugin and a CUDA GPU, the same document indexes in under 2 seconds. Large batches (100+ documents) run as Celery background tasks and report progress via the dashboard.
Can I update a document after indexing?
Yes. Re-uploading a file with the same name triggers a re-index: the old embeddings are removed from the vector store and new ones are generated from the updated content. The document record in the database is updated with the new indexing timestamp. In-flight chat sessions that had the old document in their context will use the updated version for subsequent messages.
What happens to documents when I delete a project?
Deleting a project removes all associated document records and their embeddings from the vector store. Physical files in data/uploads/ are also deleted. This is a destructive operation; the backup system (backup_service.py) should be used to preserve project data before deletion.
Can I use a custom embedding model?
Yes. The embedding_router.py routes to Ollama for CPU embedding and to the gpu_embedding plugin for GPU-accelerated embedding. The Ollama embedding model is configurable in Settings → Models. Any model available in Ollama that has embedding support can be used. The gpu_embedding plugin supports HuggingFace sentence-transformers models via PyTorch.
Your documents, indexed and searchable locally
Install Guaardvark and upload your knowledge base — PDFs, notes, transcripts, code — to a local vector store that the agent can retrieve from in every conversation.