Rust CLI and library

Evidence-preserving semantic diffs for generated PDFs.

Compare PDF revisions by extracted text blocks, semantic document surfaces, and deterministic evidence instead of unstable screenshots or raw byte changes.

Status compatibility-gate Reports JSON, AI JSON, Markdown, HTML Core Rust 2024, MSRV 1.85
diff.json stable output

Summary

Inserted
1
Modified
2
Moved
1
Diagnostics
0
payment terms delivery date signature block
{
  "id": "change-0001",
  "kind": "Modified",
  "severity": "Major",
  "confidence": 0.90,
  "reason": "paragraph text differs"
}

Features

What the project does today

spdfdiff extracts parser-backed evidence from digitally generated PDFs and emits deterministic reports for review, automation, and agent workflows.

Semantic text diffing

Builds text blocks from extracted positioned text, compares inserted, deleted, modified, moved, and layout-shifted blocks, and includes page, bounding-box, and semantic-role evidence when available.

Stable report output

Produces deterministic JSON, compact AI review JSON, Markdown, and self-contained HTML reports without timestamps, absolute paths, or random IDs. AI review tags identify repeated page-region changes separately from body text.

Text extraction

Uses /ToUnicode when available and a conservative Base14 Latin fallback for safe Helvetica, Times, and Courier-family simple-font text.

Document surface checks

Compares selected images, vector operations, annotations, links, form fields, outlines, name trees, metadata, XMP, and embedded file surfaces through typed deterministic signatures.

Parser-backed diagnostics

Surfaces unsupported or degraded PDF features through stable diagnostic codes instead of hiding gaps behind silent fallbacks.

CI check command

Runs configured PDF pairs with threshold checks, baseline suppression, deterministic artifacts, and a composite GitHub Action wrapper.

External OCR adapter

Can call SPDFDIFF_OCR_COMMAND or tesseract for supported image-only samples while preserving provenance and diagnostics.

Quickstart

Build and run locally

Build the workspace from a checkout, then compare two PDFs with the spdfdiff binary.

git clone https://github.com/eraydin/semantic-pdf-diff.git
cd semantic-pdf-diff
cargo build --workspace
.\target\debug\spdfdiff.exe diff .\old.pdf .\new.pdf --format json --output .\diff.json

CLI

Command reference

diff

Compare two PDF files and write JSON, AI JSON, Markdown, or HTML.

spdfdiff diff old.pdf new.pdf --format md --output diff.md

inspect

Inspect parser-level PDF structure and diagnostics.

spdfdiff inspect document.pdf --format json

extract

Extract text blocks and semantic evidence from one PDF.

spdfdiff extract document.pdf --format json

check

Run configured PDF comparisons for CI and policy gates.

spdfdiff check --config .spdfdiff.toml

corpus and benchmark

Evaluate sample manifests and run deterministic benchmark smoke gates.

spdfdiff corpus samples --manifest samples/compatibility_corpus_manifest.json --fail-on-gate
spdfdiff benchmark --pages 50 --output benchmark.json

review

Send AI review JSON to an optional local OpenAI-compatible HTTP endpoint.

spdfdiff review ai-review.json --endpoint http://127.0.0.1:8080/v1 --model local-model

Reports

Choose the output for your workflow

JSON is the stable machine contract. AI review JSON is compact and prompt-ready. Markdown is readable in code review. HTML is a self-contained evidence report with inline overlays where bounding boxes are available.

Format Use case
json Automated diff gates and artifact storage.
ai-json Neutral review prompts and local LLM workflows.
md Pull request notes and human-readable summaries.
html Self-contained visual evidence review.

CI

Use semantic PDF checks in GitHub Actions

The repository includes a composite Action that runs spdfdiff check, uploads deterministic artifacts, and fails the workflow when configured thresholds are exceeded.

schema_version = "1"
output_dir = "target/spdfdiff-check"
formats = ["json", "html"]
fail_on_changes = true

[[pairs]]
name = "contract"
old = "old.pdf"
new = "new.pdf"
baseline = "approved-contract-diff.json"
max_diagnostics = 0
name: PDF semantic diff

on: [pull_request]

jobs:
  pdf-diff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: eraydin/semantic-pdf-diff@main
        with:
          config: .spdfdiff.toml

Schemas

Machine-readable contracts

Stable report schemas live in the repository so downstream tools can validate diff reports, AI review artifacts, and CI check summaries.

Compatibility

Current boundary

semantic-pdf-diff is a compatibility-gate project. It is useful for controlled digitally generated PDFs and committed sample scenarios, but it is not yet a broad public-alpha compatibility claim for arbitrary real-world PDFs.

Supported parser foundation: classic xref tables, controlled xref streams, controlled object streams, selected stream filters, page tree traversal, inherited page resources, /ToUnicode and conservative Base14 Latin text extraction paths, structure-tree summaries, incremental-update metadata, and resource-limit enforcement.
Still incremental work: renderer-grade visual diffing, arbitrary table reconstruction, broad tagged-PDF coverage, and corpus-backed public-alpha compatibility labels.

Architecture

Crate map

spdfdiff_types pdf_core pdf_content pdf_text pdf_semantic diff_core diff_report spdfdiff_cli

Core PDF parsing stays in pdf_core. Content interpretation, text extraction, semantic layout, diff matching, reporting, and CLI orchestration remain separated so downstream crates do not couple semantic diff logic directly to raw PDF object internals.