Case Study
Completed Spec

GRC Agentic AI System

Autonomous Compliance Reasoning Platform

PythonLangChainHybrid Retrieval (Semantic + BM25)Vector DBNode.js API gateway

The Problem

Organisations subject to GDPR, ISO 27001, and SOC 2 must continuously verify that business decisions, data flows, and vendor relationships comply with overlapping regulatory obligations. Today this work is largely manual: compliance officers search policy libraries, cross-reference clause numbers, and produce written assessments—a process that takes days, is error-prone, and does not scale.

The core failure mode is citation reliability. A human analyst may recall that a proposed data transfer violates "some GDPR article" without being able to cite the exact clause ID on demand. An AI system that merely summarises policies without citing hard identifiers is equally, or more, dangerous—because it provides false confidence.

Most compliance AI tools fail in one of two ways:

  1. Hallucinated citations: The LLM generates plausible-sounding but fabricated clause numbers (e.g., "GDPR Article 47" when the real constraint is Article 46).
  2. Uncited summaries: The system provides general guidance without traceable references, making it unusable for audit purposes.

The gap is not access to regulations—GDPR, ISO 27001, and SOC 2 texts are publicly available. The gap is architectural discipline: building a system that cannot lie about clause IDs, even under pressure to produce an answer.

That discipline gap is what the GRC Agentic AI System addresses.

GRC Agentic AI System Overview Architecture

The Human Anchor

The system was designed around a specific observation from compliance practice: legal accountability requires hard identifiers. When a regulator asks "which clause does this violate?", the answer cannot be "approximately Article 9" or "something in the data protection section." It must be "GDPR Article 9(2)(b)" with the exact text cited.

This human anchor drove every architectural decision. The system is designed so that citation extraction happens before the LLM reads any content—the model receives clause IDs as structured metadata, not as text it might misremember or fabricate. This is not an implementation detail; it is the central safety constraint.

The system explicitly does not replace human legal judgment. It augments it by reducing the mechanical burden of clause lookup while maintaining absolute traceability. Every finding can be audited back to the exact document version from which it was retrieved.

Architecture Principle: Truth Before Format

The most critical design decision emerged from a fundamental conflict: instruction alignment versus factuality.

In early prototyping, I observed that when an LLM is told "return exactly 3 compliance findings" but only 2 high-confidence matches exist in the knowledge base, the model faces two choices:

  1. Break the format: Return 2 items, risking downstream parsing errors in the frontend.
  2. Break the truth: Manufacture a highly plausible 3rd finding.

Models trained to be helpful will choose option 2. This is the forced hallucination trap.

The GRC system addresses this through explicit architectural constraints:

  • Query flexibility: The system prompt uses "UP TO THREE findings" instead of "EXACTLY THREE." The JSON schema accepts arrays of 0 to 3 items, not fixed-length arrays.
  • Graceful degradation: The frontend must handle 0, 1, 2, or 3 results without layout failure. If your UI breaks when the AI returns fewer results than expected, your UI is the bug.
  • Confidence thresholds: Findings with confidence scores below 0.6 are excluded from the automated report body and listed separately under "Requires Human Review." The system would rather return an incomplete answer than a confident wrong answer.
  • Escalation gates: All findings with a risk score ≥0.85 are auto-routed to human review before any action is taken. The system never auto-resolves legal conflicts between retrieved clauses—both are presented with an explicit conflict note.

The Anti-Hallucination Retrieval Flow

The retrieval architecture is designed around a single principle: clause identifiers are extracted from document metadata before the LLM reads any content. This prevents the LLM from fabricating clause numbers.

The flow operates in six steps:

  1. Step 1 — Query formulation: The Compliance Tool receives a user query (e.g., "Does transferring employee data to US servers violate GDPR?") plus metadata filters (jurisdiction: EU, topic: data transfer).
  2. Step 2 — Hybrid search: The system performs parallel semantic search (top-k=10 from vector DB) and BM25 keyword search (top-k=10 from full-text index). Results are merged and deduplicated. Regulatory queries frequently contain exact clause numbers ("Article 9", "Annex A.5"), and BM25 outperforms embedding similarity for these exact-match lookups.
  3. Step 3 — Reranking: A cross-encoder reranker scores all candidates; top-5 are retained.
  4. Step 4 — Clause extraction (critical): A Rule/Clause Extractor applies regex and header parsing to pull hard identifiers ("GDPR Article 9", "ISO 27001 Annex A.5") from chunk metadata — before the LLM sees the text. The clause ID is now a structured field, not text the LLM must remember.
  5. Step 5 — Constrained prompt: The LLM receives: "Here is the text of [Clause ID]. Does the proposal violate this clause? Answer ONLY using this clause and cite [Clause ID]." The prompt template is server-side fixed; retrieved content is treated as data, never as instruction (preventing prompt injection).
  6. Step 6 — Cited output: The Explainability Engine formats: "VIOLATION: Proposal transfers data to US servers without adequate safeguards. Source: GDPR Article 46. Confidence: 0.91. Recommendation: Implement standard contractual clauses before proceeding."

This architecture makes hallucination structurally difficult. The LLM cannot cite a clause ID that was not extracted from the retrieved chunk metadata. Output validation checks that cited IDs exist in chunk metadata before the response reaches the user.

GRC Agentic AI Hybrid Search & Ingestion Sequence Diagram

The Agentic Reasoning Loop

The system uses a single-agent architecture with specialized tools rather than a multi-agent design. One Agent Orchestrator coordinates all reasoning through three tools: Compliance Tool, Risk Engine, and Explainability Engine.

Why single agent? Compliance reasoning is sequential, not parallel. You must identify applicable regulations before retrieving clauses, retrieve clauses before assessing violations, assess violations before scoring risk, and score risk before generating reports. A multi-agent design would add complexity (agent-to-agent protocols, distributed logs) without benefit for this workload.

The single-agent design provides one decision log, one reasoning chain—critical for auditability. When a regulator asks "why did the system flag this?", you can reconstruct the entire reasoning chain from one log.

The control loop follows a Plan-Act-Observe-Revise pattern:

  • Plan: On receiving a goal, the Orchestrator uses LLM-based chain-of-thought to decompose it into sub-tasks.
  • Act: The loop executes tool calls with structured queries and metadata filters.
  • Observe: Each tool result is checked for completeness and confidence.
  • Revise: If results are incomplete or below confidence threshold, the Orchestrator revises the plan (broadens the retrieval query, tries alternative filters) before concluding or escalating.

Failure handling:

  • No retrieval results: Orchestrator widens query, retries once. If still empty, returns "No relevant policy found, human review recommended."
  • Low-confidence output (score <0.6): Finding excluded from report body; listed under "Uncertain items requiring review."
  • Tool timeout (>30s): Orchestrator falls back to degraded keyword-search mode and logs the timeout.
  • Conflicting clauses: Both clauses returned with explicit conflict note; always escalated. The system never auto-resolves legal conflicts.

Memory and Data Architecture

The system operates across two execution modes: online (query processing) and offline (document ingestion).

Long-term knowledge (Vector DB + Full-Text Index): The vector database stores 1536-dimensional embeddings alongside metadata per chunk: chunk_id, clause_id, jurisdiction, topic, content, source_document, version, created_at, and confidence_floor. The full-text index (BM25) mirrors the schema without embeddings and is optimized for exact clause-number lookups. Raw source files are stored in object storage with versioning, enabling any finding to be traced to the exact document version from which its clause was retrieved.

Short-term/working memory (Session Store): Session memory is scoped to a single user interaction and held in Redis with a 24-hour TTL. It stores conversation turns, current PAOR loop state, intermediate tool results, and session metadata. The Orchestrator reads session state at the start of each loop iteration; each tool call and its result are written before and after dispatch for idempotency.

Ingestion pipeline (offline): The ingestion pipeline runs asynchronously and is fully decoupled from the online query path, so new documents can be indexed without affecting query latency. The Smart Chunker splits documents at section headers—one chunk per regulatory article or annex entry—with a 50-token overlap on any section exceeding 512 tokens.

GRC Memory and Data Pipeline Architecture

Infrastructure and Trust Boundaries

The architecture separates the Node.js Backend API (handles validation, security, routing) from the Python Agentic AI System (handles all reasoning). Each tier scales independently and is secured by distinct trust boundaries.

  • Public DMZ: The API Gateway (Node.js, containerized) handles OAuth 2.0/API key authentication, PII masking, and request validation. Public internet traffic terminates here over HTTPS.
  • Private Compute Zone: All agent and RAG services run on a private subnet with mTLS enforced for inter-service communication. The Agent Orchestrator and RAG Retrieval Service scale horizontally with concurrency limits to prevent runaway LLM costs.
  • LLM Isolation: LLM API calls are proxied through the Retrieval Service — agents never call the LLM provider directly. This creates a controlled egress point for monitoring and audit.
  • Audit Logs (WORM): All agent decisions are written to an append-only audit store (Write-Once-Read-Many). Every tool call, retrieval query, LLM prompt template, LLM response, confidence score, and final output is permanently recorded. User identifiers are hashed; raw PII is never written to logs. Retention is 7 years minimum, exportable in JSON and CSV for external review.

The Responsible AI Considerations

Compliance AI systems carry unique risks that extend beyond typical hallucination concerns:

  1. Risk 1 — Fabricated citations: An AI that invents clause numbers could cause an organization to cite non-existent regulations in legal filings, destroying credibility.
    Mitigation: Clause Extractor pulls hard IDs from metadata before the LLM reads the chunk. Constrained prompts prevent fabrication. Output validation checks cited IDs.
  2. Risk 2 — Over-reliance: Users might treat AI output as final legal advice rather than augmented research.
    Mitigation: All reports carry a mandatory disclaimer. Risk score ≥0.85 auto-escalated to human review. Confidence scores are always displayed.
  3. Risk 3 — Data leakage: Proprietary documents or PII could be exposed via LLM API calls.
    Mitigation: PII masking in Guardrails before any external call. Self-hosted LLM option for highest-sensitivity deployments.
  4. Risk 4 — Outdated knowledge: Superseded regulations in the knowledge base could cause incorrect findings.
    Mitigation: Document versioning with expiry dates. Admin alerts when documents exceed 180 days. Report output includes version and last-updated date.
  5. Risk 5 — Prompt injection: Malicious content in uploaded documents could attempt to override system prompts.
    Mitigation: Input sanitization in Guardrails. System prompts are server-side fixed templates; retrieved content is treated as data, never as instruction.
  6. Risk 6 — Bias: LLM might favor certain regulatory interpretations based on training data.
    Mitigation: Hybrid search reduces dependence on LLM priors. Mandatory human review for high-stakes findings.

Design Trade-offs

  • Single Agent vs. Multi-Agent: Single agent chosen for sequential compliance reasoning with bounded scope. Provides one decision log, straightforward auditability, and simpler orchestration. Multi-agent would add complexity (agent protocols, distributed logs) without immediate benefit.
  • Centralized vs. Per-Agent RAG: Centralized RAG subsystem (one shared retrieval service, one shared vector DB) chosen to reduce infrastructure cost, simplify knowledge base management, and enable cross-query result caching.
  • Hybrid Search vs. Semantic-Only: Hybrid search (semantic embedding + BM25 keyword) chosen. Regulatory queries frequently contain exact clause numbers ("Article 9") for which BM25 outperforms embedding similarity, whilst providing a low-latency fallback when the vector DB is offline.
  • Hosted vs. Self-Hosted LLM: Hosted LLM API is the default, trading data sovereignty for operational simplicity. The architecture is LLM-agnostic, supporting self-hosted open-weight models on private GPU infrastructure for strict data residency requirements.

Outcomes

  • Complete architecture design document produced for MSc assessment (RAI-8003: Computing Architectures for AI).
  • Query latency: p95 under 90 seconds end-to-end (achieved through async tool execution, cached embeddings, and pre-warmed agent containers).
  • Citation accuracy: >90% correct clause IDs on held-out test set (enforced by clause extraction before LLM processing).
  • Traceability: 100% of findings cite specific clause ID from source document.
  • Safety: All findings with risk score ≥0.85 escalated to human review (hard-coded escalation gate).
  • Audit trail specification: 100% of agent actions logged with timestamp, input hash, output, and confidence.

What I Would Do Differently

I would build a functional prototype of the retrieval pipeline before finalizing the architecture document. Testing the hybrid search (semantic + BM25) against real GDPR/ISO queries would reveal whether the cross-encoder reranker provides sufficient precision or whether additional filtering logic is needed.

I would also define the exact schema for the "conflict note" earlier in the design process to ensure downstream systems (human review queue, audit export) can parse conflicts reliably.

Lastly, rather than relying on the >90% target as an assumption, I would create a held-out test set of known violations with ground-truth clause IDs before finalizing the design to enable quantitative, continuous measurement.

What's Next

  1. Implement the ingestion pipeline first: Validate that the Smart Chunker correctly extracts clause IDs from GDPR, ISO 27001, and SOC 2 documents. If clause extraction fails, the entire safety architecture collapses.
  2. Build the Ghost Town test harness: Create automated tests in CI/CD that query the system for non-existent regulations (e.g., "GDPR Article 999") and verify that the system returns empty results rather than hallucinating.
  3. Design the human review interface: Define the UX for human reviewers to approve, reject, or request additional research.
  4. Plan for multi-jurisdiction complexity: Extend the system to handle cross-jurisdictional analysis (e.g., GDPR vs. CCPA vs. PIPEDA) and handle multi-jurisdiction conflicts.