ASUCICREPO · TenthEdict · Jan 27, 2026
diff --git a/README.md b/README.md
@@ -1,11 +1,12 @@
 # PDF Accessibility Solutions
 
-This repository provides two complementary solutions for PDF accessibility:
+This repository provides multiple complementary solutions for PDF accessibility:
 
 1. **PDF-to-PDF Remediation**: Processes PDFs and maintains the PDF format while improving accessibility.
 2. **PDF-to-HTML Remediation**: Converts PDFs to accessible HTML format.
+3. **Local Batch Processor**: Offline batch processing for OCR and PDF/UA-1 preparation (see [local_batch_processor/](local_batch_processor/README.md)).
 
-Both solutions leverage AWS services and generative AI to improve content accessibility according to WCAG 2.1 Level AA standards.
+The AWS-based solutions leverage AWS services and generative AI to improve content accessibility according to WCAG 2.1 Level AA standards. The local batch processor provides offline processing capabilities for pre-processing, testing, or environments without AWS access.
 
 ## Table of Contents
 
@@ -16,6 +17,7 @@ Both solutions leverage AWS services and generative AI to improve content access
 | [Testing Your PDF Accessibility Solution](#testing-your-pdf-accessibility-solution) | User guide for the working solution                     |
 | [PDF-to-PDF Remediation Solution](#pdf-to-pdf-remediation-solution)                 | PDF format preservation solution details                |
 | [PDF-to-HTML Remediation Solution](#pdf-to-html-remediation-solution)               | HTML conversion solution details                        |
+| [Local Batch Processor](#local-batch-processor)                                     | Offline batch processing for OCR and PDF/UA preparation |
 | [Monitoring](#monitoring)                                                           | System monitoring and observability                     |
 | [Troubleshooting](#troubleshooting)                                                 | Common issues and solutions                             |
 | [Contributing](#contributing)                                                       | How to contribute to the project                        |
@@ -184,6 +186,41 @@ This solution converts PDF documents to accessible HTML format while preserving
 - **ECR Repository**: Hosts the Docker image for Lambda
 - **Bedrock Data Automation**: Provides PDF parsing and extraction capabilities
 
+## Local Batch Processor
+
+### Overview
+
+The local batch processor provides offline batch processing capabilities for PDF accessibility enhancement. It's designed to complement the AWS-based solutions by enabling:
+
+- **Offline processing** without AWS infrastructure
+- **Pre-processing** before cloud upload
+- **Development/testing** workflows
+- **High-volume batch jobs** with folder structure preservation
+
+### Features
+
+- **OCR Enhancement**: Adds invisible searchable text layers using Tesseract
+- **PDF/UA-1 Preparation**: Adds compliance metadata and markers
+- **Batch Processing**: Process directory trees with structure preservation
+- **Parallel Processing**: Multi-threaded for faster throughput
+- **Progress Tracking**: Visual progress bar and JSON summary reports
+
+### Quick Start
+
+```bash
+# Install dependencies
+cd local_batch_processor
+pip install -r requirements.txt
+
+# Process a single file
+python -m local_batch_processor.cli process input.pdf output.pdf
+
+# Batch process a directory (4 parallel workers)
+python -m local_batch_processor.cli batch input_folder/ output_folder/ --workers 4
+```
+
+For detailed documentation, see [local_batch_processor/README.md](local_batch_processor/README.md).
+
 ## Monitoring
 
 ### PDF-to-PDF Solution

diff --git a/local_batch_processor/README.md b/local_batch_processor/README.md
@@ -0,0 +1,202 @@
+# Local Batch Processor for PDF Accessibility
+
+A local/offline batch processing tool for PDF accessibility enhancement. This module complements the AWS-based PDF accessibility solution by enabling:
+
+- **Offline processing** without AWS infrastructure
+- **Pre-processing** before cloud upload
+- **Development/testing** workflows
+- **High-volume batch processing** with folder structure preservation
+
+## Features
+
+- **OCR Enhancement**: Adds invisible searchable text layers using Tesseract (via ocrmypdf)
+- **PDF/UA-1 Preparation**: Adds compliance metadata and markers for accessibility
+- **Batch Processing**: Process entire directory trees with folder structure preservation
+- **Progress Tracking**: Visual progress bar with tqdm
+- **Parallel Processing**: Multi-threaded processing for faster throughput
+- **Summary Reports**: JSON reports with processing statistics
+
+## Installation
+
+### Prerequisites
+
+1. **Python 3.8+**
+2. **Tesseract OCR** (system dependency)
+
+   ```bash
+   # macOS
+   brew install tesseract
+
+   # Ubuntu/Debian
+   sudo apt-get install tesseract-ocr
+
+   # Windows
+   # Download from: https://github.com/UB-Mannheim/tesseract/wiki
+   ```
+
+3. **Ghostscript** (required by ocrmypdf)
+
+   ```bash
+   # macOS
+   brew install ghostscript
+
+   # Ubuntu/Debian
+   sudo apt-get install ghostscript
+   ```
+
+### Python Dependencies
+
+```bash
+cd local_batch_processor
+pip install -r requirements.txt
+```
+
+## Usage
+
+### Command Line Interface
+
+**Process a single PDF:**
+
+```bash
+python -m local_batch_processor.cli process input.pdf output.pdf
+```
+
+**Batch process a directory:**
+
+```bash
+python -m local_batch_processor.cli batch input_folder/ output_folder/
+```
+
+**With options:**
+
+```bash
+# Process with 4 parallel workers
+python -m local_batch_processor.cli batch input/ output/ --workers 4
+
+# Skip OCR (only apply PDF/UA metadata)
+python -m local_batch_processor.cli batch input/ output/ --skip-ocr
+
+# Force OCR even if text exists
+python -m local_batch_processor.cli batch input/ output/ --force-ocr
+
+# Set custom DPI for OCR
+python -m local_batch_processor.cli batch input/ output/ --dpi 400
+
+# Use different OCR language
+python -m local_batch_processor.cli batch input/ output/ --ocr-lang deu
+```
+
+### Python API
+
+```python
+from local_batch_processor import BatchProcessor, EnhancementService
+
+# Single file processing
+service = EnhancementService(text_threshold=100, dpi=300)
+success = service.enhance_document(
+    input_path="input.pdf",
+    output_path="output.pdf",
+    title="My Document",
+    author="Author Name",
+    language="en-US"
+)
+
+# Batch processing
+processor = BatchProcessor(text_threshold=100, dpi=300)
+summary = processor.process_batch(
+    input_dir="./pdfs",
+    output_dir="./enhanced",
+    workers=4,
+    recursive=True
+)
+
+print(f"Processed: {summary['processed']}/{summary['total_files']}")
+print(f"Failed: {summary['failed']}")
+```
+
+## Processing Pipeline
+
+1. **OCR Enhancement** (if needed)
+   - Analyzes PDF text content
+   - Applies OCR using sandwich renderer (invisible text behind visible content)
+   - Normalizes non-standard page boxes for accurate text positioning
+
+2. **PDF/UA-1 Preparation**
+   - Strips orphan tags that interfere with accessibility tools
+   - Adds PDF/UA-1 compliance metadata
+   - Sets document properties (title, author, language)
+   - Marks document for manual tagging workflow
+
+## Output Structure
+
+```
+output_folder/
+├── subfolder1/
+│   ├── document1.pdf
+│   └── document2.pdf
+├── subfolder2/
+│   └── document3.pdf
+└── batch_processing_summary.json
+```
+
+The folder structure from the input directory is preserved in the output.
+
+## Summary Report
+
+After batch processing, a `batch_processing_summary.json` file is created:
+
+```json
+{
+  "success": true,
+  "total_files": 100,
+  "processed": 98,
+  "failed": 2,
+  "total_duration": 1234.5,
+  "avg_duration_per_file": 12.3,
+  "successful_files": ["file1.pdf", "file2.pdf", ...],
+  "failed_files": [
+    {"file": "bad.pdf", "error": "Encrypted PDF"},
+    {"file": "corrupt.pdf", "error": "Invalid PDF structure"}
+  ],
+  "timestamp": "2024-01-15T10:30:00"
+}
+```
+
+## Integration with AWS Solution
+
+This local batch processor can be used alongside the AWS-based solution:
+
+1. **Pre-processing**: Process PDFs locally before uploading to S3
+2. **Testing**: Verify accessibility enhancements locally before cloud deployment
+3. **Offline workflow**: Process PDFs when AWS infrastructure is not available
+4. **High-volume batch jobs**: Process large collections locally with parallel workers
+
+## Troubleshooting
+
+### "ocrmypdf is not installed"
+
+Install Tesseract OCR and the Python package:
+
+```bash
+# Install Tesseract (system)
+brew install tesseract  # macOS
+
+# Install Python package
+pip install ocrmypdf
+```
+
+### "Cannot process encrypted PDF"
+
+The processor cannot handle password-protected PDFs. Remove protection before processing.
+
+### "OCR text positioning is incorrect"
+
+Use the `--force-ocr` flag to regenerate the text layer with corrected positioning:
+
+```bash
+python -m local_batch_processor.cli process input.pdf output.pdf --force-ocr
+```
+
+## License
+
+This module is part of the PDF Accessibility Solutions project. See the main repository LICENSE for details.
diff --git a/local_batch_processor/__init__.py b/local_batch_processor/__init__.py
@@ -0,0 +1,26 @@
+"""
+Local Batch Processor for PDF Accessibility Enhancement.
+
+This module provides local/offline batch processing capabilities
+complementing the AWS-based PDF accessibility solution.
+
+Features:
+- Recursive directory processing with folder structure preservation
+- OCR enhancement with Tesseract (via ocrmypdf)
+- PDF/UA-1 compliance preparation
+- Progress tracking and parallel processing
+- JSON summary reports
+"""
+
+from .batch_processor import BatchProcessor
+from .enhancement_service import EnhancementService
+from .ocr_enhancer import OCREnhancer
+from .pdfua_enhancer import PDFUAEnhancer
+
+__version__ = "1.0.0"
+__all__ = [
+    "BatchProcessor",
+    "EnhancementService",
+    "OCREnhancer",
+    "PDFUAEnhancer",
+]