PII Privacy Gateway: Engineering Data Trust
A production-grade security middleware designed to sanitize unstructured legal discovery data. This solution utilizes a high-performance Python engine to redact sensitive information (SSNs, Financials, AWS Keys) in real-time before it reaches LLM providers or cloud logs.
Technical Architecture Breakdown
1. The Gollog Engine (Core Library)
A standalone Python library that handles the heavy lifting of PII identification. It features a Thread-Local Storage (TLS) lock mechanism to prevent infinite recursion during log interception—a common pitfall when building security shims.
2. The Privacy Gateway (API Layer)
A FastAPI implementation providing an asynchronous entry point. It utilizes Pydantic for strict schema validation, ensuring that only valid JSON payloads are processed, which reduces the attack surface for malicious injection.
3. Component Logic
- Rule Manager: Parses YAML-based definitions for "hot-swappable" security rules.
- Sensitive Data Processor: Orchestrates multi-pass redaction logic.
- Log Interceptor: Traps outgoing log messages to ensure PII never reaches stdout or CloudWatch.
Log Interceptor: Zero-Leak Evidence
The table below demonstrates how the RedactingFormatter catches sensitive data at the application boundary before it reaches the console or cloud storage sinks.
| Log Source | Raw Output (Unprotected) | Intercepted (Protected) |
|---|---|---|
| API Request Log | Received payload: {"ssn": "000-12-3456"} | Received payload: {"ssn": "[REDACTED-SSN]"} |
| Auth Debugger | Login attempt: AKIAIMNOSTEST123 | Login attempt: [REDACTED-AWS-ID] |
| Validation Error | Failed card: 4111111111111111 | Failed card: ****-****-****-1111 |
| System Exception | Timeout: j.doe@example.com | Timeout: [REDACTED-EMAIL] |
Security & Compliance Analysis
For a legal-tech platform, maintaining SOC2 Type II compliance hinges on the "Principle of Least Privilege" regarding data access. By implementing redaction at the gateway level, we ensure that highly sensitive PII, such as claimant SSNs or medical record identifiers—never reaches downstream analytics or LLM providers.
The Log Interception architecture is particularly critical; it mitigates the risk of "log leakage," where PII is inadvertently persisted in plain text within observability stacks (like ELK or Datadog). This proactive sanitization reduces the blast radius of potential leaks by enforcing data privacy at the infrastructure layer.
Future Roadmap
Presidio Integration
Evaluating Microsoft Presidio for ML-based entity recognition alongside the current regex engine.
Dockerization
Creating a lightweight Alpine-based image for k8s sidecar deployments.
