Building NRIGPT: A Production RAG System for NRI Taxation

NRI taxation is complicated. India has tax treaties with 90+ countries, rules change frequently, and bad advice costs real money. NRIGPT is a Retrieval-Augmented Generation system we built to make NRI tax guidance accessible, accurate, and cited — serving thousands of users at a YC W22 startup.

Why RAG?

Pure LLMs hallucinate tax advice. That's not 'good enough' — it's dangerous. RAG solves this by grounding every answer in actual source documents. The model doesn't guess; it retrieves the relevant tax law, treaty provision, or CBDT circular, then generates an answer with citations.

▸Tax law changes frequently — RAG lets us update the knowledge base without retraining
▸Accuracy is non-negotiable — citations let users verify every claim
▸Domain expertise is scarce — the system makes expert-level knowledge accessible

Architecture

Document Ingestion Pipeline

10,000+ documents including Income Tax Act sections, DTAA treaties, CBDT circulars, ITAT rulings, and expert commentary. Each document type requires different parsing:

▸Legal documents: Section-aware chunking (don't split mid-clause)
▸Treaties: Article-level chunking with country metadata
▸Circulars: Date-aware chunking (newer supersedes older)
▸Rulings: Holding-first chunking (conclusion, then reasoning)

Embedding Strategy

We tested 4 embedding models. Key finding: domain-specific fine-tuning on tax vocabulary improved retrieval precision by 23% over general-purpose embeddings. The production system uses a fine-tuned model with 768-dimensional embeddings.

Chunking strategy: 512-token chunks with 50-token overlap. We experimented with larger chunks (1024 tokens) but found that smaller chunks with metadata produced better retrieval precision — the model needs the right paragraph, not the right page.

Retrieval Optimization

Naive vector similarity retrieves plausible-looking but wrong documents. Our pipeline:

▸Hybrid search: BM25 (keyword) + vector similarity — catches exact legal terms BM25 handles and semantic matches vectors handle
▸Reranking: Cross-encoder reranker scores top-50 candidates → top-5 for generation
▸Metadata filtering: Country, tax year, document type filters narrow the search space before embedding search
▸Query expansion: Tax-specific synonyms (e.g., 'DTAA' ↔ 'Double Taxation Avoidance Agreement' ↔ 'tax treaty')

Answer Generation

python

def generate_answer(query: str, user_context: dict) -> Answer:
    # Retrieve relevant documents
    candidates = hybrid_search(query, filters=user_context)
    ranked = reranker.score(query, candidates)[:5]

    # Generate with citations
    prompt = build_prompt(
        query=query,
        context=ranked,
        instructions="Cite every claim. Use [1], [2] notation. "
                     "If unsure, say so. Never fabricate tax advice."
    )
    response = llm.generate(prompt)

    # Verify citations actually exist in sources
    verified = citation_checker.validate(response, ranked)
    return verified

Challenges We Solved

Contradictory Sources

Tax law is full of contradictions — a CBDT circular might override an earlier ruling, or a treaty might supersede domestic law. We built a precedence engine: treaties > statutes > circulars > rulings > commentary. The system surfaces the most authoritative source and flags conflicts.

Citation Accuracy

The model sometimes generates plausible-looking but incorrect citations (Section 90A instead of Section 90). Our citation checker validates every reference against the source index. If a citation can't be verified, it's flagged — not shipped.

Hallucination Management

Three layers of hallucination prevention:

▸Grounding: Every claim must trace to a retrieved document
▸Confidence scoring: Low-confidence answers get a 'consult a CA' disclaimer
▸Adversarial testing: We maintain a test suite of 500+ tricky queries designed to trigger hallucinations

Production Numbers

Metric	Value
Documents indexed	10,000+
Average query latency	<3 seconds
Citation accuracy	97%+ (verified against source)
User queries served	Thousands monthly
Knowledge base updates	Weekly (new circulars, rulings)

Key Learnings

▸Chunking is everything. Bad chunks = bad retrieval = bad answers. We spent more time on chunking strategy than on model selection.
▸Hybrid search is non-negotiable. Vector-only misses exact legal terms. BM25-only misses semantic intent. You need both.
▸Evaluation is the hard part. Building the RAG pipeline took 3 weeks. Building the evaluation framework took 4 weeks. You can't improve what you can't measure.

Need a production RAG system? We've built one that serves thousands of users with 97%+ citation accuracy. Let's talk about your use case — reach out to our team.