Shadow Warden AI — Gateway v7.0 — Explore the API Reference
Home / Cyber Security / Document Intelligence
📄

Document Intelligence

FE-50 · v5.4

Convert any file — PDF, DOCX, PPTX, XLSX, HTML, images, audio, ZIP — to Markdown via Microsoft MarkItDown, then pipe the extracted text through the full 9-layer Warden security pipeline before it touches an LLM or a corporate data store.

MarkItDown SecretRedactor SemanticGuard Redis Cache Prometheus SOVA Tool #50

Processing Pipeline

1 Upload

PDF / DOCX / PPTX / XLSX / HTML / image / audio / ZIP

2 Convert

Microsoft MarkItDown → clean Markdown

3 Redact

SecretRedactor — 15 patterns + entropy scan

4 Classify

Data class: PHI / PII / FINANCIAL / CLASSIFIED / GENERAL

5 Filter

9-layer pipeline — Topology → Brain → Causal → Decision

6 Verdict

ALLOW · MEDIUM · HIGH · BLOCK + STIX audit entry

Three Ways to Use It

🖥️

Portal — No-code UI

  • Visit /doc-scanner/ in the tenant portal.
  • Drag and drop any file.
  • Get a verdict, data class, secrets found, and extracted Markdown — instantly.
Open Portal
🔗

REST API

  • POST /document-intel/convert-and-scan
  • POST /document-intel/convert
  • POST /document-intel/convert-batch
  • GET /document-intel/stats
API Docs

Filter Hook

  • Add file_base64 + file_filename to any POST /filter request.
  • The gateway converts the file and replaces content before the pipeline.
  • Fail-open: conversion errors fall back to original content.

File-Type Cache TTLs

File Types Cache TTL Reason
PDF, DOCX, PPTX, XLSX 24 h Office documents rarely change
MP3, WAV, FLAC, M4A 7 days Transcription is expensive
JPG, PNG, WEBP, GIF 1 h Images update more frequently
All others DOC_INTEL_CACHE_TTL Configurable fallback

Cache key: doc_intel:md:{sha256_of_file_bytes} — identical files are never converted twice, regardless of filename.

Configuration

Variable Default Description
DOC_INTEL_MAX_BYTES 52428800 Max file size before rejection (50 MB default)
DOC_INTEL_TIMEOUT_S 30 Per-conversion thread timeout in seconds
DOC_INTEL_CACHE_TTL 3600 Fallback Redis cache TTL (overridden by file type)
REDIS_URL redis://… Cache and stats store — set to memory:// for tests

Observability

warden_doc_intel_convert_total {ext, data_class}

Total conversions by file type and inferred data class

warden_doc_intel_convert_errors_total {ext, error}

Conversion errors — use for SLO alerting on error rate

warden_doc_intel_cache_hits_total

Redis cache hits — use to track cache efficiency

Stats also available at GET /document-intel/stats — returns total, cache_hits, errors, sensitive, secrets_found from Redis.

Shipped Features

FE-50-0

Filter Pipeline — file_base64 Hook

✅ Shipped

POST /filter accepts file_base64 + file_filename fields. Before the 9-layer pipeline runs, the file is converted to Markdown via MarkItDown and replaces content. Fail-open: conversion errors fall back to original content. Supports PDF, DOCX, PPTX, XLSX, HTML, images, audio, ZIP, EPUB.

All v5.4
FE-50-1

MarkItDown Converter

✅ Shipped

Microsoft MarkItDown converts any office or media file to clean Markdown. File-type-aware cache TTLs: PDF/DOCX/XLSX 24h, audio 7 days, images 1h. 50 MB size gate (DOC_INTEL_MAX_BYTES). 30s thread-pool timeout (DOC_INTEL_TIMEOUT_S). Redis cache keyed by SHA-256 hash.

Community+ v5.4
FE-50-2

Prometheus Metrics

✅ Shipped

3 Grafana-ready counters: warden_doc_intel_convert_total{ext,data_class}, warden_doc_intel_convert_errors_total{ext,error}, warden_doc_intel_cache_hits_total. Ready for SLO alerting on conversion error rate and cache efficiency.

All v5.4
FE-50-3

Document Intel API — 6 Endpoints

✅ Shipped

/document-intel: POST /convert, POST /convert-and-scan (SecretRedactor + SemanticGuard), POST /convert-batch, GET /health, GET /formats, GET /stats. Gated at Community Business+. Stats endpoint reads Redis hash (total, cache_hits, errors, sensitive, secrets_found).

Community+ v5.4
FE-50-4

SOVA Tool #50 — scan_document

✅ Shipped

SOVA agent can scan base64-encoded files through the full pipeline. Converts PDF/DOCX/PPTX to Markdown via MarkItDown, then runs SecretRedactor + SemanticGuard + HyperbolicBrain + CausalArbiter. Returns full FilterResponse: allowed, risk_level, secrets_found, semantic_flags.

Pro+ v5.4
FE-50-5

SOC Dashboard — Document Scans Widget

✅ Shipped

5-metric row on the SOC overview: Total Scanned, Cache Hits, Sensitive Docs, Secrets Found, Errors. Queries GET /document-intel/stats via DocScanStats type. Widget hidden gracefully when endpoint is unreachable.

All v5.4

Ready to scan your first document?

No API key needed — the portal handles authentication and proxies securely.

Open Document Scanner