Local LLMs for Public Safety: Offline RAG System for Police SOPs
Back to Log
AIRAGPythonChromaDB

Local LLMs for Public Safety: Offline RAG System for Police SOPs

5 min read
AI

Problem Context

Police officers in India need to consult complex Standard Operating Procedures (SOPs) for cybercrime reporting. These procedures vary by jurisdiction and incident type, and are documented in large PDF manuals. The challenge: build an intelligent query system for these documents under strict data security constraints—no queries can leave the local network.

Requirements:

  • No cloud API calls (no OpenAI, Anthropic, or Google)
  • Must run on standard Windows hardware found in police stations
  • Hallucinations must be minimized (wrong answers are worse than no answer)
  • Response time should be under 5 seconds for typical queries

Initial Approach

The first prototype used a simple PDF-to-text conversion and basic keyword search. This worked for exact phrase matches but failed for:

  • Queries phrased differently than the SOP text
  • Questions requiring context from multiple sections
  • Semantic understanding ("What do I do if the victim is a minor?")

The system needed semantic search, which meant embeddings and vector storage.

Design Decisions & Trade-offs

Local LLM: Ollama with Llama 3

Why local inference:

  • Government data residency requirements prohibited external API calls
  • Response time was acceptable (3-4 seconds on mid-range hardware)
  • Model could be fine-tuned later if needed

Trade-offs:

  • Smaller models (7B parameters) vs cloud-scale models (GPT-4)
  • Limited context window (4096 tokens) required careful chunk management
  • VRAM constraints meant choosing between Llama 3 (better quality) and Mistral (faster)

We chose Llama 3 8B because accuracy mattered more than speed in this use case.

Vector Store: ChromaDB

ChromaDB was selected over alternatives (Pinecone, Weaviate) because:

  • Fully local: No network calls, data stored on disk
  • Simple Python API: Easy integration with existing codebase
  • Persistence: Survives process restarts without re-embedding

Schema decision: Each document chunk is stored with metadata:

metadata = {
    "source_file": "cybercrime_sop.pdf",
    "page_number": 42,
    "section": "Financial Fraud",
    "jurisdiction": "Karnataka"
}

This allows filtering by jurisdiction before semantic search, reducing irrelevant results.

Embedding Model: nomic-embed-text

This local embedding model was chosen because:

  • Runs on CPU (no GPU required)
  • Optimized for retrieval tasks
  • Small footprint (~200MB)

Trade-off: Embedding 500 pages took ~3 minutes. We pre-process all documents on deployment rather than embedding on-demand.

Implementation Notes

PDF Parsing Challenges

The SOPs were scanned PDFs, not digital text. OCR was required.

What worked:

  • Tesseract OCR with language pack for English and Hindi
  • Post-processing to remove headers/footers and page numbers

What didn't work:

  • Tables were mangled by OCR (column alignment was lost)
  • Multi-column layouts confused paragraph order
  • Handwritten annotations were captured as garbage text

Solution: Manual review of parsed text for critical sections, then flagging problematic pages for human consultation.

Chunk Size Trade-offs

Vector search requires chunking long documents. The question: how big should each chunk be?

Options considered:

  • 256 tokens: High precision, but loses context
  • 1024 tokens: Better context, but less specific retrieval
  • 512 tokens: Middle ground

We chose 512 tokens with 50-token overlap. This preserved sentence boundaries while maintaining context.

How we validated: Ran test queries and measured retrieval accuracy. 512 tokens had the best balance of precision and recall.

Hallucination Prevention

The system must not invent procedures. Three safeguards:

1. Retrieval confidence threshold If the similarity score is below 0.7, return "I don't have information on that" rather than generating an answer.

2. Source citation Every response includes the source page number and section title, so officers can verify.

3. Prompt engineering System prompt explicitly instructs:

"If the retrieved context does not contain the answer, respond with 'I don't have information on that' and suggest consulting the manual directly."

This reduced hallucinations from ~15% (initial tests) to <2%.

Results & Impact

The system was deployed in a pilot police station in December 2025. After two months:

  • Officers query it 10-15 times per day
  • 92% of queries are answered correctly (validated by supervisors)
  • Average response time: 3.2 seconds

What stayed hard:

  • Queries about procedures not in the SOP still confuse the system
  • Multi-step procedures ("First do X, then Y") sometimes get merged incorrectly
  • OCR errors in critical sections require manual correction

Deployment Constraints

Government environments have unique challenges:

  • Windows-only (no Linux servers)
  • Limited admin access (can't install system dependencies easily)
  • Network restrictions (firewall blocks most package repositories)

Solution: Packaged the entire system as a portable Python environment with all dependencies bundled. Installation was a single ZIP extraction.

Takeaways

Building for government use taught me that reliability and explainability matter more than cutting-edge performance. A system that says "I don't know" when uncertain is more valuable than one that guesses confidently.

The offline constraint forced careful design choices that wouldn't have been necessary with cloud APIs. Those constraints led to a more robust system.

For anyone building RAG systems: spend time on chunking strategy and retrieval confidence thresholds. The embedding model matters less than getting these fundamentals right.

Share this article