NRI taxation is complicated. India has tax treaties with 90+ countries, rules change frequently, and bad advice costs real money. NRIGPT is a Retrieval-Augmented Generation system we built to make NRI tax guidance accessible, accurate, and cited — serving thousands of users at a YC W22 startup.
Why RAG?
Pure LLMs hallucinate tax advice. That's not 'good enough' — it's dangerous. RAG solves this by grounding every answer in actual source documents. The model doesn't guess; it retrieves the relevant tax law, treaty provision, or CBDT circular, then generates an answer with citations.
- ▸Tax law changes frequently — RAG lets us update the knowledge base without retraining
- ▸Accuracy is non-negotiable — citations let users verify every claim
- ▸Domain expertise is scarce — the system makes expert-level knowledge accessible
Architecture
Document Ingestion Pipeline
10,000+ documents including Income Tax Act sections, DTAA treaties, CBDT circulars, ITAT rulings, and expert commentary. Each document type requires different parsing:
- ▸Legal documents: Section-aware chunking (don't split mid-clause)
- ▸Treaties: Article-level chunking with country metadata
- ▸Circulars: Date-aware chunking (newer supersedes older)
- ▸Rulings: Holding-first chunking (conclusion, then reasoning)
Embedding Strategy
We tested 4 embedding models. Key finding: domain-specific fine-tuning on tax vocabulary improved retrieval precision by 23% over general-purpose embeddings. The production system uses a fine-tuned model with 768-dimensional embeddings.
Chunking strategy: 512-token chunks with 50-token overlap. We experimented with larger chunks (1024 tokens) but found that smaller chunks with metadata produced better retrieval precision — the model needs the right paragraph, not the right page.
Retrieval Optimization
Naive vector similarity retrieves plausible-looking but wrong documents. Our pipeline:
- ▸Hybrid search: BM25 (keyword) + vector similarity — catches exact legal terms BM25 handles and semantic matches vectors handle
- ▸Reranking: Cross-encoder reranker scores top-50 candidates → top-5 for generation
- ▸Metadata filtering: Country, tax year, document type filters narrow the search space before embedding search
- ▸Query expansion: Tax-specific synonyms (e.g., 'DTAA' ↔ 'Double Taxation Avoidance Agreement' ↔ 'tax treaty')
Answer Generation
def generate_answer(query: str, user_context: dict) -> Answer:
# Retrieve relevant documents
candidates = hybrid_search(query, filters=user_context)
ranked = reranker.score(query, candidates)[:5]
# Generate with citations
prompt = build_prompt(
query=query,
context=ranked,
instructions="Cite every claim. Use [1], [2] notation. "
"If unsure, say so. Never fabricate tax advice."
)
response = llm.generate(prompt)
# Verify citations actually exist in sources
verified = citation_checker.validate(response, ranked)
return verifiedChallenges We Solved
Contradictory Sources
Tax law is full of contradictions — a CBDT circular might override an earlier ruling, or a treaty might supersede domestic law. We built a precedence engine: treaties > statutes > circulars > rulings > commentary. The system surfaces the most authoritative source and flags conflicts.
Citation Accuracy
The model sometimes generates plausible-looking but incorrect citations (Section 90A instead of Section 90). Our citation checker validates every reference against the source index. If a citation can't be verified, it's flagged — not shipped.
Hallucination Management
Three layers of hallucination prevention:
- ▸Grounding: Every claim must trace to a retrieved document
- ▸Confidence scoring: Low-confidence answers get a 'consult a CA' disclaimer
- ▸Adversarial testing: We maintain a test suite of 500+ tricky queries designed to trigger hallucinations
Production Numbers
| Metric | Value |
|---|---|
| Documents indexed | 10,000+ |
| Average query latency | <3 seconds |
| Citation accuracy | 97%+ (verified against source) |
| User queries served | Thousands monthly |
| Knowledge base updates | Weekly (new circulars, rulings) |
Key Learnings
- ▸Chunking is everything. Bad chunks = bad retrieval = bad answers. We spent more time on chunking strategy than on model selection.
- ▸Hybrid search is non-negotiable. Vector-only misses exact legal terms. BM25-only misses semantic intent. You need both.
- ▸Evaluation is the hard part. Building the RAG pipeline took 3 weeks. Building the evaluation framework took 4 weeks. You can't improve what you can't measure.
Need a production RAG system? We've built one that serves thousands of users with 97%+ citation accuracy. Let's talk about your use case — reach out to our team.