{"id":2923,"date":"2025-11-11T01:14:08","date_gmt":"2025-11-11T01:14:08","guid":{"rendered":"https:\/\/blog.samarthya.me\/wps\/?p=2923"},"modified":"2025-11-11T01:14:09","modified_gmt":"2025-11-11T01:14:09","slug":"building-a-practical-retrieval-augmented-personal-assistant-rag","status":"publish","type":"post","link":"https:\/\/blog.samarthya.me\/wps\/2025\/11\/11\/building-a-practical-retrieval-augmented-personal-assistant-rag\/","title":{"rendered":"Building a Practical Retrieval-Augmented Personal Assistant (RAG)"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"1024\" src=\"https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/rag.png\" alt=\"\" class=\"wp-image-2924\" srcset=\"https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/rag.png 1024w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/rag-150x150@2x.png 300w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/rag-150x150.png 150w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/rag-300x300@2x.png 600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The elements of information discussed here are all present in my <a href=\"https:\/\/github.com\/samarthya\/rag-one.git\">github repository<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-pullquote has-black-color has-luminous-vivid-amber-background-color has-text-color has-background has-link-color has-medium-font-size wp-elements-ba020744ad3999745424ef7d2371b279\"><blockquote><p>with <code>Ollama<\/code>, <code>LangChain<\/code>, and <code>Chroma<\/code><\/p><\/blockquote><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"1-why-i-built-this\">1. Why I Built This<\/h2>\n\n\n\n<p>Large Language Models are powerful, but they hallucinate and forget your private knowledge. We set out to build a small, local Retrieval-Augmented Generation (RAG) assistant that:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Lets us query our own documents (PDFs, text files, Word, Excel) safely.<\/li>\n\n\n\n<li>Keeps responses grounded in actual source chunks.<\/li>\n\n\n\n<li>Runs locally (LLM + embeddings via Ollama) inside WSL while Ollama itself runs on Windows.<\/li>\n\n\n\n<li>Is easy to iterate: add documents, re-index, ask questions\u2014no model fine-tuning.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"2-what-is-rag-in-plain-terms\">2. What Is RAG (In Plain Terms)<\/h2>\n\n\n\n<p>RAG glues together two stages:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Retrieval: Find the&nbsp;<strong>most relevant<\/strong>&nbsp;pieces of text for a user question (using embeddings + vector similarity).<\/li>\n\n\n\n<li>Generation: Ask the LLM to answer using only those retrieved chunks as context.<\/li>\n<\/ol>\n\n\n\n<p>Why? It keeps answers current and factual without stuffing all your data into the model weights.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"3-core-objectives-mapped-to-implementation\">3. Core Objectives Mapped to Implementation<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Objective<\/th><th>Implementation<\/th><\/tr><\/thead><tbody><tr><td>Multi-format document ingestion<\/td><td>Specialized loaders in&nbsp;<code>src\/document_processor.py<\/code><\/td><\/tr><tr><td>Precise context selection<\/td><td>Chunking (RecursiveCharacterTextSplitter) with overlap<\/td><\/tr><tr><td>Efficient semantic search<\/td><td>Embeddings (<code>nomic-embed-text<\/code>&nbsp;via Ollama) + Chroma vector store<\/td><\/tr><tr><td>Grounded answers + sources<\/td><td>RetrievalQA chain in&nbsp;<code>src\/rag_engine.py<\/code>&nbsp;returns chunks + metadata<\/td><\/tr><tr><td>Local-only operation<\/td><td>WSL Python code calls Windows Ollama via IP (<code>OLLAMA_BASE_URL<\/code>&nbsp;in&nbsp;<code>src\/config.py<\/code>)<\/td><\/tr><tr><td>Configurability<\/td><td>Central knobs (<code>CHUNK_SIZE<\/code>,&nbsp;<code>CHUNK_OVERLAP<\/code>,&nbsp;<code>TOP_K_RESULTS<\/code>, model names) in&nbsp;<code>src\/config.py<\/code><\/td><\/tr><tr><td>Defensive robustness<\/td><td>None guards around vector store &amp; QA chain initialization<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">What is chunking (in RAG)?<\/h3>\n\n\n\n<p>Chunking is the process of splitting larger documents into smaller, manageable text units (chunks) before generating embeddings. Instead of embedding an entire PDF or a huge page, you embed bite\u2011sized segments. At query time you retrieve only the most relevant chunks (not whole documents) and feed them to the LLM. <\/p>\n\n\n\n<p>This improves:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Precision: You avoid stuffing irrelevant paragraphs into the prompt.<\/li>\n\n\n\n<li>Recall: Multiple distinct, relevant parts of a long document can surface independently.<\/li>\n\n\n\n<li>Token efficiency: Smaller pieces fit comfortably into the model\u2019s context window.<\/li>\n\n\n\n<li>Grounding quality: Less unrelated filler reduces hallucination risk.<\/li>\n<\/ul>\n\n\n\n<p>Typical trade\u2011off:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Chunks too large \u2192 waste context space; retrieval brings in lots of irrelevant text.<\/li>\n\n\n\n<li>Chunks too small \u2192 you lose semantic coherence (fragmented sentences, missing context).<\/li>\n\n\n\n<li>Overlap smooths boundaries so meaning crossing chunk edges isn\u2019t lost.<\/li>\n<\/ul>\n\n\n\n<p>In my <a href=\"https:\/\/raw.githubusercontent.com\/samarthya\/rag-one\/refs\/heads\/main\/src\/config.py\">config.py<\/a>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>CHUNK_SIZE = 1000 (characters)<\/li>\n\n\n\n<li>CHUNK_OVERLAP = 200 (characters)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What is RecursiveCharacterTextSplitter?<\/h3>\n\n\n\n<p><code>RecursiveCharacterTextSplitter<\/code>&nbsp;(from LangChain) is a hierarchical splitter. You give it:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A list of separator candidates (e.g. [&#8220;\\n\\n&#8221;, &#8220;\\n&#8221;, &#8221; &#8220;, &#8220;&#8221;])<\/li>\n\n\n\n<li>A target chunk_size and chunk_overlap.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"4-high-level-architecture\">4. High-Level Architecture<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"811\" height=\"1024\" src=\"https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/image-1-811x1024.png\" alt=\"\" class=\"wp-image-2927\" srcset=\"https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/image-1-811x1024.png 811w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/image-1-238x300.png 238w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/image-1.png 1180w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/11\/image-1-238x300@2x.png 476w\" sizes=\"(max-width: 811px) 100vw, 811px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"5-the-ingestion-pipeline\">5. The Ingestion Pipeline<\/h2>\n\n\n\n<p>Ingestion pipeline is the end\u2011to\u2011end process that transforms raw documents into a searchable semantic index used later for retrieval. In plain terms: it takes files, cleans and segments them, converts each meaningful chunk into an embedding, and stores those embeddings so queries can efficiently find relevant context.<\/p>\n\n\n\n<p>Implemented in&nbsp;<code>src\/document_processor.py<\/code>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Load: Select loader by file extension (<code>PyPDFLoader<\/code>,&nbsp;<code>TextLoader<\/code>,&nbsp;<code>UnstructuredWordDocumentLoader<\/code>, <code>Excel<\/code> via&nbsp;<code>openpyxl<\/code>).<\/li>\n\n\n\n<li>Normalize: Wrap content into&nbsp;<code>Document<\/code>&nbsp;objects with&nbsp;<code>metadata['source']<\/code>&nbsp;(and sheet names for Excel).<\/li>\n\n\n\n<li>Chunk: Split with&nbsp;<code>RecursiveCharacterTextSplitter<\/code>&nbsp;using:\n<ul class=\"wp-block-list\">\n<li><code>CHUNK_SIZE = 1000<\/code><\/li>\n\n\n\n<li><code>CHUNK_OVERLAP = 200<\/code>&nbsp;Overlap prevents awkward sentence splits.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Embed: Each chunk converted to a vector via Ollama embeddings (<code>nomic-embed-text<\/code>).<\/li>\n\n\n\n<li>Persist: Store vectors in Chroma located at&nbsp;<code>data\/vectorstore\/<\/code>&nbsp;(avoid recomputation next run).<\/li>\n<\/ol>\n\n\n\n<p>Public entry point:&nbsp;<code>process_all_documents()<\/code>&nbsp;returns stats (files processed, chunks created).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"6-retrieval--answering\">6. Retrieval &amp; Answering<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Retrieval: Think of it as asking a librarian to pull the most relevant index cards from your library for your question. We don\u2019t hand the AI the whole book; we hand it just the best snippets.<\/li>\n\n\n\n<li>Answering: The AI reads those snippets and writes an answer that sticks to what\u2019s in them. If the snippets don\u2019t contain the answer, it should say so.<\/li>\n<\/ul>\n\n\n\n<p>Implemented in&nbsp;<code>src\/rag_engine.py<\/code>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Converts Chroma vector store to a retriever (<code>vectorstore.as_retriever(k=TOP_K_RESULTS)<\/code>).<\/li>\n\n\n\n<li>Builds a prompt template combining:\n<ul class=\"wp-block-list\">\n<li>System instructions (<code>SYSTEM_PROMPT<\/code>)<\/li>\n\n\n\n<li>Retrieved chunk text<\/li>\n\n\n\n<li>User question<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Runs a LangChain&nbsp;<code>RetrievalQA<\/code>&nbsp;chain with chain type&nbsp;<code>\"stuff\"<\/code>&nbsp;(simple concatenation).<\/li>\n\n\n\n<li>Returns answer + source metadata (filenames \/ pages \/ sheets) + raw context chunks.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"7-configuration-strategy\">7. Configuration Strategy<\/h2>\n\n\n\n<p>Centralized in&nbsp;<code>src\/config.py<\/code>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Paths:&nbsp;<code>DOCUMENTS_DIR<\/code>,&nbsp;<code>VECTORSTORE_DIR<\/code><\/li>\n\n\n\n<li>Models:&nbsp;<code>LLM_MODEL = \"gpt-oss:20b\"<\/code>,&nbsp;<code>EMBEDDING_MODEL = \"nomic-embed-text\"<\/code><\/li>\n\n\n\n<li>Chunking knobs:&nbsp;<code>CHUNK_SIZE<\/code>,&nbsp;<code>CHUNK_OVERLAP<\/code><\/li>\n\n\n\n<li>Retrieval knob:&nbsp;<code>TOP_K_RESULTS = 4<\/code><\/li>\n\n\n\n<li>Optional similarity filtering stub:&nbsp;<code>SIMILARITY_THRESHOLD<\/code>&nbsp;(can be applied in&nbsp;<code>search<\/code>&nbsp;later)<\/li>\n\n\n\n<li>WSL\/Ollama IP bridging:\u00a0<code>WINDOWS_IP<\/code>\u00a0+\u00a0<code>OLLAMA_BASE_URL<\/code><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"8-wsl--ollama-integration\">8. WSL + Ollama Integration<\/h2>\n\n\n\n<p>Running Python in WSL while Ollama hosts models on Windows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Discover Windows host IP from WSL:<code>ip route show | grep default | awk '{print $3}'<\/code><\/li>\n\n\n\n<li>Set&nbsp;<code>WINDOWS_IP<\/code>&nbsp;in&nbsp;<code>src\/config.py<\/code>.<\/li>\n\n\n\n<li>All embed &amp; generate calls go through&nbsp;<code>http:\/\/&lt;WINDOWS_IP&gt;:11434<\/code>.<\/li>\n<\/ul>\n\n\n\n<p>This avoids Docker complexity and keeps local iteration fast.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"9-defensive-coding-patterns\">9. Defensive Coding Patterns<\/h2>\n\n\n\n<p>To prevent subtle runtime errors:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Annotated&nbsp;<code>self.vectorstore: Optional[Chroma]<\/code>&nbsp;and guard before use.<\/li>\n\n\n\n<li>QA chain creation inside try\/except; failure doesn\u2019t crash initialization.<\/li>\n\n\n\n<li>Graceful fallback answers when knowledge base is empty.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"10-running-the-system\">10. Running the System<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code><em># Start the app (if you have a Streamlit UI defined in src\/app.py)<\/em>\nstreamlit run src\/app.py\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"11-concepts-along-the-way\">11. Concepts Along the Way<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Concept<\/th><th>Why It Matters Here<\/th><\/tr><\/thead><tbody><tr><td>Embeddings<\/td><td>Enable semantic similarity search instead of brittle keyword matching<\/td><\/tr><tr><td>Chunking<\/td><td>Provides granularity; improves recall &amp; reduces prompt bloat<\/td><\/tr><tr><td>Overlap<\/td><td>Maintains continuity across boundaries; prevents dropped context<\/td><\/tr><tr><td>Vector Store (Chroma)<\/td><td>Persistent, efficient similarity search layer<\/td><\/tr><tr><td>Prompt Template<\/td><td>Ensures consistent grounding and honesty (don\u2019t fabricate)<\/td><\/tr><tr><td>RetrievalQA Chain<\/td><td>Orchestrates retrieval + prompt assembly seamlessly<\/td><\/tr><tr><td>System Prompt<\/td><td>Establishes tone and factual discipline for answers<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"12-common-pitfalls--mitigations\">12. Common Pitfalls &amp; Mitigations<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Pitfall<\/th><th>Mitigation<\/th><\/tr><\/thead><tbody><tr><td>Empty results due to improper chunking<\/td><td>Adjust&nbsp;<code>CHUNK_SIZE<\/code>&nbsp;\/&nbsp;<code>CHUNK_OVERLAP<\/code>; re-index<\/td><\/tr><tr><td>Slow indexing for large docs<\/td><td>Run once; persists to disk; process incremental additions<\/td><\/tr><tr><td>Irrelevant retrieval<\/td><td>Reduce chunk size or increase&nbsp;<code>TOP_K_RESULTS<\/code>&nbsp;then filter low-similarity<\/td><\/tr><tr><td>Hallucinations<\/td><td>System prompt enforces \u201cIf not in documents, say so.\u201d<\/td><\/tr><tr><td>Excel ingestion failing<\/td><td>Install&nbsp;<code>openpyxl<\/code>&nbsp;before indexing spreadsheets<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"14-summary\">Summary<\/h2>\n\n\n\n<p>We built a self-contained RAG assistant:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Local, privacy-preserving.<\/li>\n\n\n\n<li>Structured for clarity: ingestion vs query-time logic.<\/li>\n\n\n\n<li>Configurable and well-documented.<\/li>\n\n\n\n<li>Guarded against uninitialized components.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>The elements of information discussed here are all present in my github repository. with Ollama, LangChain, and Chroma 1. Why I Built This Large Language Models are powerful, but they hallucinate and forget your private knowledge. We set out to build a small, local Retrieval-Augmented Generation (RAG) assistant that: 2. What Is RAG (In Plain [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[34],"tags":[],"class_list":["post-2923","post","type-post","status-publish","format-standard","hentry","category-technical"],"_links":{"self":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2923","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/comments?post=2923"}],"version-history":[{"count":2,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2923\/revisions"}],"predecessor-version":[{"id":2930,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2923\/revisions\/2930"}],"wp:attachment":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/media?parent=2923"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/categories?post=2923"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/tags?post=2923"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}