Document Redaction App

Test out the different OCR methods available. Click on an example below and then the 'Extract text and redact document' button:

Examples

Choose a PDF document or image file (PDF, JPG, PNG)

Choose text extraction method. Local options are lower quality but cost nothing - they may be worth a try if you are willing to spend some time reviewing outputs. If shown,AWS Textract has a cost per page - £1.14 ($1.50) without signature detection (default), £2.66 ($3.50) per 1,000 pages with signature detection. Change this in the tab below (AWS Textract signature detection).

Local model - selectable text Local OCR model - PDFs without selectable text

Choose a local OCR model. "tesseract" is the default and will work for documents with clear typed text. "paddle" is more accurate for text extraction where the text is not clear or well-formatted, but word-level extract is not natively supported, and so word bounding boxes will be inaccurate. "vlm" will call the chosen vision model (VLM) to return a structured json output that is then parsed into word-level bounding boxes. "hybrid-paddle-vlm" is a combination of PaddleOCR with the chosen VLM.

tesseract paddle vlm hybrid-paddle-vlm

Choose personal information detection method. The local model is lower quality but costs nothing - it may be worth a try if you are willing to spend some time reviewing outputs, or if you are only interested in searching for custom search terms (see Redaction settings - custom deny list). If shown, AWS Comprehend has a cost of around £0.0075 ($0.01) per 10,000 characters.

Only extract text (no redaction) Local Local transformers LLM

If you only want to redact certain pages, or certain entities (e.g. just email addresses, or a custom list of terms), please go to the Redaction Settings tab.

Output summary

Output files

Upload original or '..._for_review.pdf' PDF to begin review process.

Upload review files here to review suggested redactions. 'review_file' csv The 'ocr_results with words' file can also be provided for searching text and making new redactions.

Search for duplicate pages/subdocuments in your ocr_output files. By default, this function will search for duplicate text across multiple pages, and then join consecutive matching pages together into matched 'subdocuments'. The results can be reviewed below, false positives removed, and then the verified results applied to a document you have loaded in on the 'Review redactions' tab.

Upload one or multiple 'ocr_output.csv' files to find duplicate pages and subdocuments

Analysis summary

Click on a row to select it for preview or exclusion.

Click a row in the table, then click this button to remove it from the results and update the downloadable files.

Full Text Preview of Selected Match

Downloadable Files

Download analysis summary and redaction lists (.csv)

Choose a Word or tabular data file (xlsx or csv) to redact. Note that when redacting complex Word files with e.g. images, some content/formatting will be removed, and it may not attempt to redact headers. You may prefer to convert the doc file to PDF in Word, and then run it through the first tab of this app (Print to PDF in print settings). Alternatively, an xlsx file output is provided when redacting docx files directly to allow for copying and pasting outputs back into the original document if preferred.

Choose Excel or csv files

Choose PII detection method. AWS Comprehend has a cost of approximately $0.01 per 10,000 characters.

Local Local transformers LLM

Output result

Output files

Import allow list file - csv table with one column of a different word/phrase on each row (case insensitive). Terms in this file will not be redacted.

Custom allow list load status

Import custom deny list - csv table with one column of a different word/phrase on each row (case insensitive). Terms in this file will always be redacted.

Custom deny list load status

Import fully redacted pages list - csv table with one column of page numbers on each row. Page numbers in this file will be fully redacted.

Fully redacted page list load status

Document redaction

Test out the different OCR methods available. Click on an example below and then the 'Extract text and redact document' button:

Matching Strategy

Analysis summary

Full Text Preview of Selected Match

Downloadable Files

Duplicate Analysis Results

Remove Duplicate Rows