Document redaction

Redact personally identifiable information (PII) from documents (PDF, PNG, JPG), Word files (DOCX), or tabular data (XLSX/CSV/Parquet). Please see the User Guide for a full walkthrough of all the features and settings.

To start, upload a document below (or click on an example), then click 'Extract text and redact document' to redact the document. Then, view and modify suggested redactions on the 'Review redactions' tab.

NOTE: The app is not 100% accurate, and it will miss some personal information. It is essential that all outputs are reviewed by a human before using the final outputs.

Test out the different OCR methods available. Click on an example below and then the 'Extract text and redact document' button:

Examples
Choose text extraction method
Choose a local OCR model. "tesseract" is the default and will work for documents with clear typed text. "paddle" is more accurate for text extraction where the text is not clear or well-formatted, but word-level extract is not natively supported, and so word bounding boxes will be inaccurate. "vlm" will call the chosen vision model (VLM) to return a structured json output that is then parsed into word-level bounding boxes. "hybrid-paddle-vlm" is a combination of PaddleOCR with the chosen VLM.
Choose redaction method

Find and redact whole pages that contain duplicate text. See the 'Identify duplicate pages' tab for all settings and duplicate sentence/passage redaction.

Choose personal information detection model. Note that AWS Comprehend, if shown, has a cost of around £0.0075 ($0.01) per 10,000 characters.
Local PII identification model (click empty space in box for full list)
TITLES
PERSON
PHONE_NUMBER
EMAIL_ADDRESS
STREETNAME
UKPOSTCODE
CUSTOM
Allow list (never redact these words)
Deny list (always redact these words)
Fully redact these pages