diff --git a/README.md b/README.md index d5fa19f..4b8dd2d 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,12 @@ # PDF Accessibility Solutions -This repository provides two complementary solutions for PDF accessibility: +This repository provides multiple complementary solutions for PDF accessibility: 1. **PDF-to-PDF Remediation**: Processes PDFs and maintains the PDF format while improving accessibility. 2. **PDF-to-HTML Remediation**: Converts PDFs to accessible HTML format. +3. **Local Batch Processor**: Offline batch processing for OCR and PDF/UA-1 preparation (see [local_batch_processor/](local_batch_processor/README.md)). -Both solutions leverage AWS services and generative AI to improve content accessibility according to WCAG 2.1 Level AA standards. +The AWS-based solutions leverage AWS services and generative AI to improve content accessibility according to WCAG 2.1 Level AA standards. The local batch processor provides offline processing capabilities for pre-processing, testing, or environments without AWS access. ## Table of Contents @@ -16,6 +17,7 @@ Both solutions leverage AWS services and generative AI to improve content access | [Testing Your PDF Accessibility Solution](#testing-your-pdf-accessibility-solution) | User guide for the working solution | | [PDF-to-PDF Remediation Solution](#pdf-to-pdf-remediation-solution) | PDF format preservation solution details | | [PDF-to-HTML Remediation Solution](#pdf-to-html-remediation-solution) | HTML conversion solution details | +| [Local Batch Processor](#local-batch-processor) | Offline batch processing for OCR and PDF/UA preparation | | [Monitoring](#monitoring) | System monitoring and observability | | [Troubleshooting](#troubleshooting) | Common issues and solutions | | [Contributing](#contributing) | How to contribute to the project | @@ -184,6 +186,41 @@ This solution converts PDF documents to accessible HTML format while preserving - **ECR Repository**: Hosts the Docker image for Lambda - **Bedrock Data Automation**: Provides PDF parsing and extraction capabilities +## Local Batch Processor + +### Overview + +The local batch processor provides offline batch processing capabilities for PDF accessibility enhancement. It's designed to complement the AWS-based solutions by enabling: + +- **Offline processing** without AWS infrastructure +- **Pre-processing** before cloud upload +- **Development/testing** workflows +- **High-volume batch jobs** with folder structure preservation + +### Features + +- **OCR Enhancement**: Adds invisible searchable text layers using Tesseract +- **PDF/UA-1 Preparation**: Adds compliance metadata and markers +- **Batch Processing**: Process directory trees with structure preservation +- **Parallel Processing**: Multi-threaded for faster throughput +- **Progress Tracking**: Visual progress bar and JSON summary reports + +### Quick Start + +```bash +# Install dependencies +cd local_batch_processor +pip install -r requirements.txt + +# Process a single file +python -m local_batch_processor.cli process input.pdf output.pdf + +# Batch process a directory (4 parallel workers) +python -m local_batch_processor.cli batch input_folder/ output_folder/ --workers 4 +``` + +For detailed documentation, see [local_batch_processor/README.md](local_batch_processor/README.md). + ## Monitoring ### PDF-to-PDF Solution diff --git a/local_batch_processor/README.md b/local_batch_processor/README.md new file mode 100644 index 0000000..6bae27d --- /dev/null +++ b/local_batch_processor/README.md @@ -0,0 +1,202 @@ +# Local Batch Processor for PDF Accessibility + +A local/offline batch processing tool for PDF accessibility enhancement. This module complements the AWS-based PDF accessibility solution by enabling: + +- **Offline processing** without AWS infrastructure +- **Pre-processing** before cloud upload +- **Development/testing** workflows +- **High-volume batch processing** with folder structure preservation + +## Features + +- **OCR Enhancement**: Adds invisible searchable text layers using Tesseract (via ocrmypdf) +- **PDF/UA-1 Preparation**: Adds compliance metadata and markers for accessibility +- **Batch Processing**: Process entire directory trees with folder structure preservation +- **Progress Tracking**: Visual progress bar with tqdm +- **Parallel Processing**: Multi-threaded processing for faster throughput +- **Summary Reports**: JSON reports with processing statistics + +## Installation + +### Prerequisites + +1. **Python 3.8+** +2. **Tesseract OCR** (system dependency) + + ```bash + # macOS + brew install tesseract + + # Ubuntu/Debian + sudo apt-get install tesseract-ocr + + # Windows + # Download from: https://github.com/UB-Mannheim/tesseract/wiki + ``` + +3. **Ghostscript** (required by ocrmypdf) + + ```bash + # macOS + brew install ghostscript + + # Ubuntu/Debian + sudo apt-get install ghostscript + ``` + +### Python Dependencies + +```bash +cd local_batch_processor +pip install -r requirements.txt +``` + +## Usage + +### Command Line Interface + +**Process a single PDF:** + +```bash +python -m local_batch_processor.cli process input.pdf output.pdf +``` + +**Batch process a directory:** + +```bash +python -m local_batch_processor.cli batch input_folder/ output_folder/ +``` + +**With options:** + +```bash +# Process with 4 parallel workers +python -m local_batch_processor.cli batch input/ output/ --workers 4 + +# Skip OCR (only apply PDF/UA metadata) +python -m local_batch_processor.cli batch input/ output/ --skip-ocr + +# Force OCR even if text exists +python -m local_batch_processor.cli batch input/ output/ --force-ocr + +# Set custom DPI for OCR +python -m local_batch_processor.cli batch input/ output/ --dpi 400 + +# Use different OCR language +python -m local_batch_processor.cli batch input/ output/ --ocr-lang deu +``` + +### Python API + +```python +from local_batch_processor import BatchProcessor, EnhancementService + +# Single file processing +service = EnhancementService(text_threshold=100, dpi=300) +success = service.enhance_document( + input_path="input.pdf", + output_path="output.pdf", + title="My Document", + author="Author Name", + language="en-US" +) + +# Batch processing +processor = BatchProcessor(text_threshold=100, dpi=300) +summary = processor.process_batch( + input_dir="./pdfs", + output_dir="./enhanced", + workers=4, + recursive=True +) + +print(f"Processed: {summary['processed']}/{summary['total_files']}") +print(f"Failed: {summary['failed']}") +``` + +## Processing Pipeline + +1. **OCR Enhancement** (if needed) + - Analyzes PDF text content + - Applies OCR using sandwich renderer (invisible text behind visible content) + - Normalizes non-standard page boxes for accurate text positioning + +2. **PDF/UA-1 Preparation** + - Strips orphan tags that interfere with accessibility tools + - Adds PDF/UA-1 compliance metadata + - Sets document properties (title, author, language) + - Marks document for manual tagging workflow + +## Output Structure + +``` +output_folder/ +├── subfolder1/ +│ ├── document1.pdf +│ └── document2.pdf +├── subfolder2/ +│ └── document3.pdf +└── batch_processing_summary.json +``` + +The folder structure from the input directory is preserved in the output. + +## Summary Report + +After batch processing, a `batch_processing_summary.json` file is created: + +```json +{ + "success": true, + "total_files": 100, + "processed": 98, + "failed": 2, + "total_duration": 1234.5, + "avg_duration_per_file": 12.3, + "successful_files": ["file1.pdf", "file2.pdf", ...], + "failed_files": [ + {"file": "bad.pdf", "error": "Encrypted PDF"}, + {"file": "corrupt.pdf", "error": "Invalid PDF structure"} + ], + "timestamp": "2024-01-15T10:30:00" +} +``` + +## Integration with AWS Solution + +This local batch processor can be used alongside the AWS-based solution: + +1. **Pre-processing**: Process PDFs locally before uploading to S3 +2. **Testing**: Verify accessibility enhancements locally before cloud deployment +3. **Offline workflow**: Process PDFs when AWS infrastructure is not available +4. **High-volume batch jobs**: Process large collections locally with parallel workers + +## Troubleshooting + +### "ocrmypdf is not installed" + +Install Tesseract OCR and the Python package: + +```bash +# Install Tesseract (system) +brew install tesseract # macOS + +# Install Python package +pip install ocrmypdf +``` + +### "Cannot process encrypted PDF" + +The processor cannot handle password-protected PDFs. Remove protection before processing. + +### "OCR text positioning is incorrect" + +Use the `--force-ocr` flag to regenerate the text layer with corrected positioning: + +```bash +python -m local_batch_processor.cli process input.pdf output.pdf --force-ocr +``` + +## License + +This module is part of the PDF Accessibility Solutions project. See the main repository LICENSE for details. diff --git a/local_batch_processor/__init__.py b/local_batch_processor/__init__.py new file mode 100644 index 0000000..b615c10 --- /dev/null +++ b/local_batch_processor/__init__.py @@ -0,0 +1,26 @@ +""" +Local Batch Processor for PDF Accessibility Enhancement. + +This module provides local/offline batch processing capabilities +complementing the AWS-based PDF accessibility solution. + +Features: +- Recursive directory processing with folder structure preservation +- OCR enhancement with Tesseract (via ocrmypdf) +- PDF/UA-1 compliance preparation +- Progress tracking and parallel processing +- JSON summary reports +""" + +from .batch_processor import BatchProcessor +from .enhancement_service import EnhancementService +from .ocr_enhancer import OCREnhancer +from .pdfua_enhancer import PDFUAEnhancer + +__version__ = "1.0.0" +__all__ = [ + "BatchProcessor", + "EnhancementService", + "OCREnhancer", + "PDFUAEnhancer", +] diff --git a/local_batch_processor/batch_processor.py b/local_batch_processor/batch_processor.py new file mode 100644 index 0000000..b8a9466 --- /dev/null +++ b/local_batch_processor/batch_processor.py @@ -0,0 +1,313 @@ +#!/usr/bin/env python3 +""" +Batch Processor for PDF Accessibility Enhancement. + +Provides batch processing capabilities with: +- Recursive directory walking +- Folder structure preservation +- Progress tracking +- Parallel processing support +- Summary reporting +""" + +import logging +import json +from pathlib import Path +from typing import Union, Dict, List +from datetime import datetime +from concurrent.futures import ThreadPoolExecutor, as_completed + +try: + from tqdm import tqdm + TQDM_AVAILABLE = True +except ImportError: + TQDM_AVAILABLE = False + +from .enhancement_service import EnhancementService + +logger = logging.getLogger(__name__) + + +class BatchProcessor: + """ + Batch processor for PDF documents. + + Features: + - Recursive directory walking + - Preserves folder structure in output + - Progress bar (if tqdm available) + - Parallel processing (ThreadPoolExecutor) + - JSON summary report + """ + + def __init__( + self, + text_threshold: int = 100, + dpi: int = 300, + language: str = 'eng' + ): + """ + Initialize the batch processor. + + Args: + text_threshold: Minimum chars/page to skip OCR + dpi: DPI for OCR processing + language: OCR language code + """ + self.service = EnhancementService( + text_threshold=text_threshold, + dpi=dpi, + language=language + ) + + def process_batch( + self, + input_dir: Union[str, Path], + output_dir: Union[str, Path], + workers: int = 1, + recursive: bool = True, + skip_ocr: bool = False, + force_ocr: bool = False, + **kwargs + ) -> Dict: + """ + Process all PDFs in input directory. + + Args: + input_dir: Input directory containing PDFs + output_dir: Output directory for enhanced PDFs + workers: Number of parallel workers (1 = sequential) + recursive: Process subdirectories recursively + skip_ocr: Skip OCR step for all files + force_ocr: Force OCR even if text exists + + Returns: + Dict: Summary report with processing statistics + """ + try: + input_dir = Path(input_dir) + output_dir = Path(output_dir) + + if not input_dir.exists(): + logger.error(f"Input directory not found: {input_dir}") + return {"success": False, "error": "Input directory not found"} + + output_dir.mkdir(parents=True, exist_ok=True) + + # Find all PDFs + pdf_files = self._find_pdf_files(input_dir, recursive) + + if not pdf_files: + logger.warning(f"No PDF files found in: {input_dir}") + return { + "success": True, + "total_files": 0, + "processed": 0, + "failed": 0, + "message": "No PDF files found" + } + + logger.info(f"Found {len(pdf_files)} PDF files") + logger.info(f"Using {workers} worker(s)") + + start_time = datetime.now() + + if workers > 1: + results = self._process_parallel( + pdf_files, input_dir, output_dir, + workers, skip_ocr, force_ocr, **kwargs + ) + else: + results = self._process_sequential( + pdf_files, input_dir, output_dir, + skip_ocr, force_ocr, **kwargs + ) + + end_time = datetime.now() + duration = (end_time - start_time).total_seconds() + + summary = self._generate_summary(results, duration) + self._save_summary(output_dir, summary) + + logger.info( + f"Batch complete: {summary['processed']}/{summary['total_files']} " + f"successful" + ) + return summary + + except Exception as e: + logger.error(f"Error in batch processing: {e}") + import traceback + traceback.print_exc() + return {"success": False, "error": str(e)} + + def _find_pdf_files(self, directory: Path, recursive: bool) -> List[Path]: + """Find all PDF files in directory.""" + if recursive: + return sorted(directory.rglob("*.pdf")) + else: + return sorted(directory.glob("*.pdf")) + + def _get_output_path( + self, + input_file: Path, + input_dir: Path, + output_dir: Path + ) -> Path: + """Calculate output path preserving folder structure.""" + relative_path = input_file.relative_to(input_dir) + return output_dir / relative_path + + def _process_sequential( + self, + pdf_files: List[Path], + input_dir: Path, + output_dir: Path, + skip_ocr: bool, + force_ocr: bool, + **kwargs + ) -> List[Dict]: + """Process files sequentially with progress bar.""" + results = [] + + if TQDM_AVAILABLE: + iterator = tqdm(pdf_files, desc="Processing PDFs", unit="file") + else: + iterator = pdf_files + logger.info("Processing files (install tqdm for progress bar)") + + for pdf_file in iterator: + result = self._process_single_file( + pdf_file, input_dir, output_dir, + skip_ocr, force_ocr, **kwargs + ) + results.append(result) + + return results + + def _process_parallel( + self, + pdf_files: List[Path], + input_dir: Path, + output_dir: Path, + workers: int, + skip_ocr: bool, + force_ocr: bool, + **kwargs + ) -> List[Dict]: + """Process files in parallel.""" + results = [] + + with ThreadPoolExecutor(max_workers=workers) as executor: + future_to_file = { + executor.submit( + self._process_single_file, + pdf_file, input_dir, output_dir, + skip_ocr, force_ocr, **kwargs + ): pdf_file + for pdf_file in pdf_files + } + + if TQDM_AVAILABLE: + iterator = tqdm( + as_completed(future_to_file), + total=len(pdf_files), + desc="Processing PDFs", + unit="file" + ) + else: + iterator = as_completed(future_to_file) + logger.info(f"Processing {len(pdf_files)} files with {workers} workers") + + for future in iterator: + result = future.result() + results.append(result) + + return results + + def _process_single_file( + self, + input_file: Path, + input_dir: Path, + output_dir: Path, + skip_ocr: bool, + force_ocr: bool, + **kwargs + ) -> Dict: + """Process a single PDF file.""" + start_time = datetime.now() + + try: + output_path = self._get_output_path(input_file, input_dir, output_dir) + output_path.parent.mkdir(parents=True, exist_ok=True) + + success = self.service.enhance_document( + input_file, + output_path, + skip_ocr=skip_ocr, + force_ocr=force_ocr, + **kwargs + ) + + end_time = datetime.now() + duration = (end_time - start_time).total_seconds() + + return { + "input_file": str(input_file), + "output_file": str(output_path), + "success": success, + "duration": duration, + "error": None + } + + except Exception as e: + end_time = datetime.now() + duration = (end_time - start_time).total_seconds() + + logger.error(f"Error processing {input_file}: {e}") + + return { + "input_file": str(input_file), + "output_file": None, + "success": False, + "duration": duration, + "error": str(e) + } + + def _generate_summary(self, results: List[Dict], duration: float) -> Dict: + """Generate summary report from processing results.""" + total_files = len(results) + successful = sum(1 for r in results if r["success"]) + failed = total_files - successful + + avg_duration = ( + sum(r["duration"] for r in results) / total_files + if total_files > 0 else 0 + ) + + return { + "success": True, + "total_files": total_files, + "processed": successful, + "failed": failed, + "total_duration": duration, + "avg_duration_per_file": avg_duration, + "successful_files": [ + r["input_file"] for r in results if r["success"] + ], + "failed_files": [ + {"file": r["input_file"], "error": r["error"]} + for r in results if not r["success"] + ], + "timestamp": datetime.now().isoformat() + } + + def _save_summary(self, output_dir: Path, summary: Dict) -> None: + """Save summary report to JSON file.""" + try: + summary_file = output_dir / "batch_processing_summary.json" + with open(summary_file, 'w') as f: + json.dump(summary, f, indent=2) + logger.info(f"Saved summary: {summary_file}") + except Exception as e: + logger.warning(f"Could not save summary: {e}") diff --git a/local_batch_processor/cli.py b/local_batch_processor/cli.py new file mode 100644 index 0000000..0f4d6ce --- /dev/null +++ b/local_batch_processor/cli.py @@ -0,0 +1,246 @@ +#!/usr/bin/env python3 +""" +CLI for Local Batch PDF Accessibility Enhancement. + +Provides command-line interface for: +- Single file processing +- Batch processing with folder structure preservation +""" + +import logging +import sys +from pathlib import Path +from typing import Optional + +try: + import typer + from rich.console import Console + from rich.logging import RichHandler + RICH_AVAILABLE = True +except ImportError: + RICH_AVAILABLE = False + import argparse + +from .enhancement_service import EnhancementService +from .batch_processor import BatchProcessor + +if RICH_AVAILABLE: + app = typer.Typer( + name="pdf-batch", + help="Local Batch PDF Accessibility Enhancement Tool", + add_completion=False + ) + console = Console() + + +def setup_logging(verbose: bool = False): + """Setup logging configuration.""" + level = logging.DEBUG if verbose else logging.INFO + + if RICH_AVAILABLE: + logging.basicConfig( + level=level, + format="%(message)s", + datefmt="[%X]", + handlers=[RichHandler(rich_tracebacks=True, console=console)] + ) + else: + logging.basicConfig( + level=level, + format="%(asctime)s - %(levelname)s - %(message)s" + ) + + +if RICH_AVAILABLE: + @app.command() + def process( + input_path: Path = typer.Argument(..., help="Input PDF file"), + output_path: Path = typer.Argument(..., help="Output PDF file"), + title: Optional[str] = typer.Option(None, "--title", "-t", help="Document title"), + author: Optional[str] = typer.Option(None, "--author", "-a", help="Document author"), + language: str = typer.Option("en-US", "--language", "-l", help="Document language"), + skip_ocr: bool = typer.Option(False, "--skip-ocr", help="Skip OCR processing"), + force_ocr: bool = typer.Option(False, "--force-ocr", help="Force OCR even if text exists"), + text_threshold: int = typer.Option(100, "--text-threshold", help="Min chars/page to skip OCR"), + dpi: int = typer.Option(300, "--dpi", help="DPI for OCR"), + ocr_language: str = typer.Option("eng", "--ocr-lang", help="Tesseract language code"), + verbose: bool = typer.Option(False, "--verbose", "-v", help="Verbose logging"), + ): + """Process a single PDF file.""" + setup_logging(verbose) + + console.print("\n[bold]PDF Accessibility Enhancement[/bold]") + console.print(f"Input: {input_path}") + console.print(f"Output: {output_path}\n") + + if not input_path.exists(): + console.print(f"[red]Error: Input file not found: {input_path}[/red]") + raise typer.Exit(code=1) + + try: + service = EnhancementService( + text_threshold=text_threshold, + dpi=dpi, + language=ocr_language + ) + + success = service.enhance_document( + input_path=input_path, + output_path=output_path, + title=title, + author=author, + language=language, + skip_ocr=skip_ocr, + force_ocr=force_ocr + ) + + if success: + console.print(f"\n[green]✓ Success![/green] Saved to: {output_path}") + raise typer.Exit(code=0) + else: + console.print("\n[red]✗ Enhancement failed[/red]") + raise typer.Exit(code=1) + + except typer.Exit: + raise + except Exception as e: + console.print(f"\n[red]✗ Error:[/red] {e}") + raise typer.Exit(code=1) + + @app.command() + def batch( + input_dir: Path = typer.Argument(..., help="Input directory"), + output_dir: Path = typer.Argument(..., help="Output directory"), + workers: int = typer.Option(1, "--workers", "-w", help="Parallel workers (1=sequential)"), + recursive: bool = typer.Option(True, "--recursive/--no-recursive", help="Process subdirs"), + skip_ocr: bool = typer.Option(False, "--skip-ocr", help="Skip OCR for all files"), + force_ocr: bool = typer.Option(False, "--force-ocr", help="Force OCR"), + text_threshold: int = typer.Option(100, "--text-threshold", help="Min chars/page"), + dpi: int = typer.Option(300, "--dpi", help="DPI for OCR"), + ocr_language: str = typer.Option("eng", "--ocr-lang", help="Tesseract language"), + verbose: bool = typer.Option(False, "--verbose", "-v", help="Verbose logging"), + ): + """Process multiple PDF files preserving folder structure.""" + setup_logging(verbose) + + console.print("\n[bold]Batch PDF Processing[/bold]") + console.print(f"Input: {input_dir}") + console.print(f"Output: {output_dir}") + console.print(f"Workers: {workers}") + console.print(f"Recursive: {recursive}\n") + + if not input_dir.exists(): + console.print(f"[red]Error: Input directory not found: {input_dir}[/red]") + raise typer.Exit(code=1) + + try: + processor = BatchProcessor( + text_threshold=text_threshold, + dpi=dpi, + language=ocr_language + ) + + summary = processor.process_batch( + input_dir=input_dir, + output_dir=output_dir, + workers=workers, + recursive=recursive, + skip_ocr=skip_ocr, + force_ocr=force_ocr + ) + + if summary.get("success"): + console.print("\n[bold]Summary:[/bold]") + console.print(f" Total: {summary['total_files']}") + console.print(f" [green]Success:[/green] {summary['processed']}") + console.print(f" [red]Failed:[/red] {summary['failed']}") + console.print(f" Duration: {summary['total_duration']:.1f}s") + console.print(f" Avg/file: {summary['avg_duration_per_file']:.1f}s") + + if summary['failed'] > 0: + console.print("\n[yellow]Failed files:[/yellow]") + for failed in summary['failed_files']: + console.print(f" - {failed['file']}: {failed['error']}") + + console.print(f"\n[green]✓ Complete![/green]") + console.print(f"Summary: {output_dir}/batch_processing_summary.json") + + raise typer.Exit(code=0 if summary['failed'] == 0 else 1) + else: + console.print(f"\n[red]✗ Failed:[/red] {summary.get('error')}") + raise typer.Exit(code=1) + + except typer.Exit: + raise + except Exception as e: + console.print(f"\n[red]✗ Error:[/red] {e}") + raise typer.Exit(code=1) + + @app.command() + def version(): + """Display version information.""" + console.print("\n[bold]Local Batch PDF Accessibility Enhancement[/bold]") + console.print("Version: 1.0.0") + console.print("\nFeatures:") + console.print(" - OCR with Tesseract (ocrmypdf)") + console.print(" - PDF/UA-1 compliance preparation") + console.print(" - Batch processing with folder preservation") + console.print(" - Parallel processing support\n") + + def main(): + """Entry point for CLI.""" + app() + +else: + # Fallback for systems without typer/rich + def main(): + """Simple argparse-based CLI fallback.""" + parser = argparse.ArgumentParser( + description="Local Batch PDF Accessibility Enhancement" + ) + subparsers = parser.add_subparsers(dest="command", help="Commands") + + # Process command + process_parser = subparsers.add_parser("process", help="Process single PDF") + process_parser.add_argument("input_path", type=Path, help="Input PDF") + process_parser.add_argument("output_path", type=Path, help="Output PDF") + process_parser.add_argument("--skip-ocr", action="store_true") + process_parser.add_argument("--force-ocr", action="store_true") + process_parser.add_argument("--verbose", "-v", action="store_true") + + # Batch command + batch_parser = subparsers.add_parser("batch", help="Batch process PDFs") + batch_parser.add_argument("input_dir", type=Path, help="Input directory") + batch_parser.add_argument("output_dir", type=Path, help="Output directory") + batch_parser.add_argument("--workers", "-w", type=int, default=1) + batch_parser.add_argument("--skip-ocr", action="store_true") + batch_parser.add_argument("--force-ocr", action="store_true") + batch_parser.add_argument("--verbose", "-v", action="store_true") + + args = parser.parse_args() + setup_logging(getattr(args, 'verbose', False)) + + if args.command == "process": + service = EnhancementService() + success = service.enhance_document( + args.input_path, args.output_path, + skip_ocr=args.skip_ocr, force_ocr=args.force_ocr + ) + sys.exit(0 if success else 1) + + elif args.command == "batch": + processor = BatchProcessor() + summary = processor.process_batch( + args.input_dir, args.output_dir, + workers=args.workers, + skip_ocr=args.skip_ocr, force_ocr=args.force_ocr + ) + sys.exit(0 if summary.get("success") and summary.get("failed", 0) == 0 else 1) + + else: + parser.print_help() + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/local_batch_processor/enhancement_service.py b/local_batch_processor/enhancement_service.py new file mode 100644 index 0000000..7c46510 --- /dev/null +++ b/local_batch_processor/enhancement_service.py @@ -0,0 +1,147 @@ +#!/usr/bin/env python3 +""" +Enhancement Service for PDF Accessibility. + +Orchestrates the two-step enhancement pipeline: +1. OCR enhancement (if needed) +2. PDF/UA-1 compliance preparation +""" + +import logging +import tempfile +from pathlib import Path +from typing import Union, Optional + +from .ocr_enhancer import OCREnhancer +from .pdfua_enhancer import PDFUAEnhancer + +logger = logging.getLogger(__name__) + + +class EnhancementService: + """ + Orchestration service for PDF enhancement pipeline. + + Pipeline: + 1. OCR Enhancement (adds invisible searchable text) + 2. PDF/UA-1 Enhancement (metadata and compliance markers) + """ + + def __init__( + self, + text_threshold: int = 100, + dpi: int = 300, + language: str = 'eng' + ): + """ + Initialize the enhancement service. + + Args: + text_threshold: Minimum chars/page to skip OCR (default: 100) + dpi: DPI for OCR processing (default: 300) + language: OCR language code (default: 'eng') + """ + self.ocr_enhancer = OCREnhancer( + text_threshold=text_threshold, + dpi=dpi, + language=language + ) + self.pdfua_enhancer = PDFUAEnhancer() + + logger.info("EnhancementService initialized") + + def enhance_document( + self, + input_path: Union[str, Path], + output_path: Union[str, Path], + title: Optional[str] = None, + author: Optional[str] = None, + language: str = "en-US", + skip_ocr: bool = False, + force_ocr: bool = False, + **kwargs + ) -> bool: + """ + Enhance a PDF document through the complete pipeline. + + Args: + input_path: Path to input PDF + output_path: Path for enhanced PDF + title: Document title (default: from filename) + author: Document author (default: None) + language: Document language (default: "en-US") + skip_ocr: Skip OCR step entirely (default: False) + force_ocr: Force OCR even if text exists (default: False) + + Returns: + bool: True if successful, False otherwise + """ + try: + input_path = Path(input_path) + output_path = Path(output_path) + + if not input_path.exists(): + logger.error(f"Input file not found: {input_path}") + return False + + output_path.parent.mkdir(parents=True, exist_ok=True) + + logger.info(f"Starting enhancement: {input_path}") + logger.info(f"Output: {output_path}") + + # Step 1: OCR Enhancement + if skip_ocr: + logger.info("Skipping OCR (skip_ocr=True)") + ocr_output = input_path + ocr_performed = False + else: + with tempfile.NamedTemporaryFile( + suffix='.pdf', + delete=False, + dir=output_path.parent + ) as tmp_file: + ocr_output = Path(tmp_file.name) + + logger.info("Step 1/2: OCR Enhancement") + ocr_success = self.ocr_enhancer.enhance_file( + input_path, + ocr_output, + force_ocr=force_ocr + ) + + if not ocr_success: + logger.error("OCR enhancement failed") + if ocr_output.exists(): + ocr_output.unlink() + return False + + ocr_performed = True + + # Step 2: PDF/UA Enhancement + logger.info("Step 2/2: PDF/UA-1 Enhancement") + pdfua_success = self.pdfua_enhancer.enhance_file( + ocr_output, + output_path, + title=title, + author=author, + language=language, + ocr_performed=ocr_performed + ) + + # Cleanup temporary OCR file + if not skip_ocr and ocr_output != input_path: + if ocr_output.exists(): + ocr_output.unlink() + + if not pdfua_success: + logger.error("PDF/UA enhancement failed") + return False + + logger.info(f"Enhancement complete: {output_path}") + return True + + except Exception as e: + logger.error(f"Error in enhancement pipeline: {e}") + import traceback + traceback.print_exc() + return False diff --git a/local_batch_processor/ocr_enhancer.py b/local_batch_processor/ocr_enhancer.py new file mode 100644 index 0000000..6cfa4f4 --- /dev/null +++ b/local_batch_processor/ocr_enhancer.py @@ -0,0 +1,282 @@ +#!/usr/bin/env python3 +""" +OCR Enhancer for PDF Accessibility. + +Creates invisible OCR text layers that screen readers can access +while preserving the visual appearance of the document. +""" + +import logging +import shutil +import tempfile +import os +from pathlib import Path +from typing import Union + +try: + import fitz # PyMuPDF +except ImportError: + raise ImportError("PyMuPDF is required. Install with: pip install PyMuPDF") + +try: + import ocrmypdf + OCRMYPDF_AVAILABLE = True +except ImportError: + OCRMYPDF_AVAILABLE = False + +try: + import pikepdf + PIKEPDF_AVAILABLE = True +except ImportError: + PIKEPDF_AVAILABLE = False + +logger = logging.getLogger(__name__) + + +class OCREnhancer: + """ + OCR enhancer that creates invisible searchable text layers. + + Features: + - Intelligent OCR detection (checks existing text content) + - Sandwich renderer for invisible text placement + - Page box normalization for non-standard PDFs + - Configurable DPI and language settings + """ + + def __init__( + self, + text_threshold: int = 100, + dpi: int = 300, + language: str = 'eng' + ): + """ + Initialize the OCR enhancer. + + Args: + text_threshold: Minimum avg chars/page to skip OCR (default: 100) + dpi: DPI for OCR processing (default: 300) + language: Tesseract language code (default: 'eng') + """ + if not OCRMYPDF_AVAILABLE: + raise ImportError( + "ocrmypdf is required. Install with: pip install ocrmypdf" + ) + + self.text_threshold = text_threshold + self.dpi = dpi + self.language = language + + logger.info( + f"OCREnhancer initialized (DPI: {dpi}, Lang: {language}, " + f"Threshold: {text_threshold})" + ) + + def needs_ocr(self, pdf_path: Union[str, Path]) -> bool: + """ + Check if PDF needs OCR by analyzing text content. + + Args: + pdf_path: Path to the PDF file + + Returns: + bool: True if OCR is needed, False otherwise + """ + try: + pdf_path = Path(pdf_path) + doc = fitz.open(pdf_path) + + total_pages = len(doc) + if total_pages == 0: + doc.close() + return True + + # Sample first few pages + sample_pages = min(3, total_pages) + total_chars = 0 + + for page_num in range(sample_pages): + page = doc[page_num] + text = page.get_text().strip() + total_chars += len(text) + + doc.close() + + avg_chars = total_chars / sample_pages + needs_ocr = avg_chars < self.text_threshold + + logger.info( + f"Text analysis: {avg_chars:.1f} avg chars/page - " + f"OCR needed: {needs_ocr}" + ) + return needs_ocr + + except Exception as e: + logger.error(f"Error analyzing PDF text: {e}") + return True + + def enhance_file( + self, + input_path: Union[str, Path], + output_path: Union[str, Path], + force_ocr: bool = False, + **kwargs + ) -> bool: + """ + Apply OCR to PDF file if needed. + + Args: + input_path: Path to the input PDF + output_path: Path for the enhanced PDF + force_ocr: Force OCR even if text exists (default: False) + + Returns: + bool: True if successful, False otherwise + """ + try: + input_path = Path(input_path) + output_path = Path(output_path) + + if not input_path.exists(): + logger.error(f"Input file not found: {input_path}") + return False + + output_path.parent.mkdir(parents=True, exist_ok=True) + + if not force_ocr and not self.needs_ocr(input_path): + logger.info("PDF has sufficient text - copying to output") + shutil.copy2(input_path, output_path) + return True + + logger.info(f"Starting OCR processing: {input_path}") + return self._apply_ocr(input_path, output_path, force_ocr=force_ocr) + + except Exception as e: + logger.error(f"Error in OCR enhancement: {e}") + return False + + def _apply_ocr( + self, + input_path: Path, + output_path: Path, + force_ocr: bool = False + ) -> bool: + """Apply OCR using ocrmypdf.""" + try: + # Normalize page boxes for non-standard PDFs + temp_normalized = None + if force_ocr and PIKEPDF_AVAILABLE: + logger.info("Normalizing page boxes before OCR") + normalized_path = self._normalize_page_boxes(input_path) + if normalized_path != input_path: + temp_normalized = normalized_path + input_path = normalized_path + + options = { + "language": self.language, + "redo_ocr": force_ocr, + "force_ocr": False, + "skip_text": not force_ocr, + "optimize": 0, + "output_type": "pdf", + "pdf_renderer": "sandwich", + "progress_bar": False, + "image_dpi": self.dpi, + "rotate_pages": True, + "remove_vectors": False, + } + + ocrmypdf.ocr(input_path, output_path, **options) + + if temp_normalized: + try: + os.unlink(temp_normalized) + except Exception: + pass + + logger.info(f"OCR complete: {output_path}") + return True + + except ocrmypdf.exceptions.PriorOcrFoundError: + logger.info("PDF already has OCR text, copying") + shutil.copy2(input_path, output_path) + return True + + except ocrmypdf.exceptions.EncryptedPdfError: + logger.error(f"Cannot process encrypted PDF: {input_path}") + return False + + except Exception as e: + logger.error(f"Error applying OCR: {e}") + return False + + def _normalize_page_boxes(self, input_path: Path) -> Path: + """ + Normalize PDF page boxes to handle non-standard coordinates. + + Some PDFs have MediaBox with negative origins which causes + OCR text positioning issues. + """ + if not PIKEPDF_AVAILABLE: + return input_path + + try: + temp_fd, temp_path = tempfile.mkstemp(suffix='.pdf') + os.close(temp_fd) + + with pikepdf.open(input_path) as pdf: + pages_normalized = 0 + + for page_num, page in enumerate(pdf.pages): + if '/MediaBox' not in page: + continue + + media_box = page.MediaBox + x0 = float(media_box[0]) + y0 = float(media_box[1]) + x1 = float(media_box[2]) + y1 = float(media_box[3]) + + if x0 != 0 or y0 != 0: + width = x1 - x0 + height = y1 - y0 + + # Shift content to match normalized coordinates + transformation = f"1 0 0 1 {-x0} {-y0} cm\n".encode('latin-1') + + if '/Contents' in page: + contents = page.Contents + if isinstance(contents, pikepdf.Array): + first_stream = contents[0] + old_data = first_stream.read_bytes() + first_stream.write(transformation + old_data) + else: + old_data = contents.read_bytes() + contents.write(transformation + old_data) + + page.MediaBox = pikepdf.Array([0, 0, width, height]) + + for box_name in ['/CropBox', '/TrimBox', '/BleedBox', '/ArtBox']: + if box_name in page: + box = page[box_name] + page[box_name] = pikepdf.Array([ + float(box[0]) - x0, + float(box[1]) - y0, + float(box[2]) - x0, + float(box[3]) - y0 + ]) + + pages_normalized += 1 + + pdf.save(temp_path) + + if pages_normalized > 0: + logger.info(f"Normalized {pages_normalized} pages") + return Path(temp_path) + else: + os.unlink(temp_path) + return input_path + + except Exception as e: + logger.warning(f"Could not normalize page boxes: {e}") + return input_path diff --git a/local_batch_processor/pdfua_enhancer.py b/local_batch_processor/pdfua_enhancer.py new file mode 100644 index 0000000..b7c7453 --- /dev/null +++ b/local_batch_processor/pdfua_enhancer.py @@ -0,0 +1,286 @@ +#!/usr/bin/env python3 +""" +PDF/UA Enhancer for accessibility compliance. + +Prepares PDFs for PDF/UA-1 compliance by: +- Stripping orphan tags that interfere with proper tagging +- Adding required metadata and compliance markers +- Setting document properties (title, author, language) +""" + +import logging +import re +from pathlib import Path +from typing import Union, Optional +from datetime import datetime + +try: + import pikepdf + from pikepdf import Pdf, Name, String, Dictionary, Array +except ImportError: + raise ImportError("pikepdf is required. Install with: pip install pikepdf") + +logger = logging.getLogger(__name__) + + +class PDFUAEnhancer: + """ + PDF/UA enhancer for accessibility compliance preparation. + + Features: + - Orphan tag stripping for clean Acrobat workflow + - PDF/UA-1 metadata and compliance flags + - Document property management + - Structure tree cleanup after OCR + """ + + def __init__(self): + """Initialize the PDF/UA enhancer.""" + logger.info("PDFUAEnhancer initialized") + + def enhance_file( + self, + input_path: Union[str, Path], + output_path: Union[str, Path], + title: Optional[str] = None, + author: Optional[str] = None, + language: str = "en-US", + ocr_performed: bool = False, + **kwargs + ) -> bool: + """ + Apply PDF/UA-1 compliance enhancements. + + Args: + input_path: Path to the input PDF + output_path: Path for the enhanced PDF + title: Document title (default: from filename) + author: Document author (default: None) + language: Document language (default: "en-US") + ocr_performed: Whether OCR was just performed + + Returns: + bool: True if successful, False otherwise + """ + try: + input_path = Path(input_path) + output_path = Path(output_path) + + if not input_path.exists(): + logger.error(f"Input file not found: {input_path}") + return False + + output_path.parent.mkdir(parents=True, exist_ok=True) + + logger.info(f"Loading PDF: {input_path}") + document = pikepdf.open(input_path) + + self._enhance(document, input_path, title, author, language, ocr_performed) + + logger.info(f"Saving enhanced PDF: {output_path}") + document.save(output_path) + document.close() + + logger.info(f"PDF/UA enhancement complete: {output_path}") + return True + + except Exception as e: + logger.error(f"Error in PDF/UA enhancement: {e}") + return False + + def _enhance( + self, + document: pikepdf.Pdf, + input_path: Path, + title: Optional[str], + author: Optional[str], + language: str, + ocr_performed: bool = False + ) -> None: + """Apply PDF/UA-1 enhancements.""" + # Clean up tag structure + if ocr_performed: + self._remove_incomplete_struct_tree(document) + else: + pages_cleaned = self._strip_orphan_tags(document) + if pages_cleaned > 0: + logger.info(f"Stripped orphan tags from {pages_cleaned} pages") + + # Set title + if not title: + title = self._extract_title(document, input_path) + + # Update metadata and add compliance markers + self._update_metadata(document, title, author, language) + self._add_pdfua_compliance(document, language) + + logger.info("Applied PDF/UA-1 enhancements") + + def _remove_incomplete_struct_tree(self, document: pikepdf.Pdf) -> None: + """ + Remove incomplete StructTreeRoot but preserve marked content. + + After OCR, we remove orphan structure references while keeping + the marked content operators that Acrobat needs for manual tagging. + """ + logger.info("Removing incomplete StructTreeRoot...") + + if Name.StructTreeRoot in document.Root: + del document.Root[Name.StructTreeRoot] + logger.info("Removed StructTreeRoot") + + pages_cleaned = 0 + for page in document.pages: + if Name.StructParents in page: + del page[Name.StructParents] + pages_cleaned += 1 + + if pages_cleaned > 0: + logger.info(f"Removed StructParents from {pages_cleaned} pages") + + logger.info("Preserved marked content for Acrobat tagging") + + def _strip_orphan_tags(self, document: pikepdf.Pdf) -> int: + """ + Strip orphan tagged content that interferes with tag tree creation. + + Returns: + int: Number of pages cleaned + """ + logger.info("Starting orphan tag cleanup...") + pages_cleaned = 0 + + try: + # Check for valid structure + if Name.StructTreeRoot in document.Root: + logger.info("Document has StructTreeRoot - preserving") + return 0 + + logger.info("No StructTreeRoot - stripping orphan tags") + + for page in document.pages: + if Name.StructParents in page: + del page[Name.StructParents] + pages_cleaned += 1 + + try: + if Name.Contents in page: + contents = page.Contents + if isinstance(contents, pikepdf.Array): + for stream in contents: + self._strip_marked_content(stream) + else: + self._strip_marked_content(contents) + except Exception as e: + logger.debug(f"Could not clean stream: {e}") + + if pages_cleaned > 0: + logger.info(f"Removed orphan StructParents from {pages_cleaned} pages") + + except Exception as e: + logger.warning(f"Error stripping orphan tags: {e}") + + return pages_cleaned + + def _strip_marked_content(self, stream) -> bool: + """Strip marked content operators from a stream.""" + try: + content_bytes = stream.read_raw_bytes() + + try: + content_str = content_bytes.decode('utf-8', errors='ignore') + except Exception: + content_str = content_bytes.decode('latin-1', errors='ignore') + + if not any(op in content_str for op in [' BMC', ' BDC', ' EMC']): + return False + + # Remove marked content operators + content_str = re.sub(r'/\w+\s+(?:<<[^>]*>>\s+)?BDC\s*', '', content_str) + content_str = re.sub(r'/\w+\s+BMC\s*', '', content_str) + content_str = re.sub(r'EMC\s*', '', content_str) + + stream.write( + content_str.encode('latin-1', errors='ignore'), + filter=pikepdf.Name.FlateDecode + ) + return True + + except Exception as e: + logger.debug(f"Could not strip marked content: {e}") + return False + + def _extract_title(self, document: pikepdf.Pdf, input_path: Path) -> str: + """Extract or generate document title.""" + try: + with document.open_metadata() as meta: + title = meta.get("dc:title") + if title and title.strip(): + return str(title).strip() + except Exception: + pass + + # Fallback to filename + title = input_path.stem.replace("_", " ").replace("-", " ") + return title.title().strip() + + def _update_metadata( + self, + document: pikepdf.Pdf, + title: str, + author: Optional[str], + language: str + ) -> None: + """Update document metadata.""" + try: + with document.open_metadata(set_pikepdf_as_editor=False) as meta: + if title: + meta["dc:title"] = title + if author: + meta["dc:creator"] = author + meta["dc:language"] = language + meta["xmp:CreateDate"] = datetime.now().isoformat() + meta["pdf:Producer"] = "Local Batch Processor for PDF Accessibility" + except Exception as e: + logger.warning(f"Error updating metadata: {e}") + + def _add_pdfua_compliance(self, document: pikepdf.Pdf, language: str) -> None: + """Add PDF/UA-1 compliance markers.""" + # Mark document for tagging + if Name.MarkInfo not in document.Root: + document.Root.MarkInfo = Dictionary() + + document.Root.MarkInfo[Name.Marked] = True + document.Root.MarkInfo[Name.Suspects] = True + + # Set language + document.Root[Name.Lang] = String(language) + + # Add PDF/UA-1 OutputIntent + output_intent = Dictionary( + Type=Name.OutputIntent, + S=Name.GTS_PDFUA1, + OutputConditionIdentifier=String("PDF/UA-1"), + Info=String("PDF for Universal Accessibility"), + RegistryName=String("http://www.color.org") + ) + + if Name.OutputIntents not in document.Root: + document.Root.OutputIntents = Array() + + # Check for duplicates + pdfua_exists = any( + intent.get(Name.S) == Name.GTS_PDFUA1 + for intent in document.Root.OutputIntents + if isinstance(intent, Dictionary) + ) + + if not pdfua_exists: + document.Root.OutputIntents.append(output_intent) + + # Add XMP metadata + try: + with document.open_metadata() as meta: + meta["pdfuaid:part"] = "1" + except Exception as e: + logger.warning(f"Error adding XMP metadata: {e}") diff --git a/local_batch_processor/requirements.txt b/local_batch_processor/requirements.txt new file mode 100644 index 0000000..20f3b07 --- /dev/null +++ b/local_batch_processor/requirements.txt @@ -0,0 +1,11 @@ +# Core dependencies +PyMuPDF>=1.23.0 +pikepdf>=8.0.0 +ocrmypdf>=15.0.0 + +# CLI (optional but recommended) +typer>=0.9.0 +rich>=13.0.0 + +# Progress tracking (optional) +tqdm>=4.65.0