Semantic text diffing
Builds text blocks from extracted positioned text, compares inserted, deleted, modified, moved, and layout-shifted blocks, and includes page, bounding-box, and semantic-role evidence when available.
Rust CLI and library
Compare PDF revisions by extracted text blocks, semantic document surfaces, and deterministic evidence instead of unstable screenshots or raw byte changes.
Summary
{
"id": "change-0001",
"kind": "Modified",
"severity": "Major",
"confidence": 0.90,
"reason": "paragraph text differs"
}
Features
spdfdiff extracts parser-backed evidence from digitally
generated PDFs and emits deterministic reports for review,
automation, and agent workflows.
Builds text blocks from extracted positioned text, compares inserted, deleted, modified, moved, and layout-shifted blocks, and includes page, bounding-box, and semantic-role evidence when available.
Produces deterministic JSON, compact AI review JSON, Markdown, and self-contained HTML reports without timestamps, absolute paths, or random IDs. AI review tags identify repeated page-region changes separately from body text.
Uses /ToUnicode when available and a conservative
Base14 Latin fallback for safe Helvetica, Times, and
Courier-family simple-font text.
Compares selected images, vector operations, annotations, links, form fields, outlines, name trees, metadata, XMP, and embedded file surfaces through typed deterministic signatures.
Surfaces unsupported or degraded PDF features through stable diagnostic codes instead of hiding gaps behind silent fallbacks.
Runs configured PDF pairs with threshold checks, baseline suppression, deterministic artifacts, and a composite GitHub Action wrapper.
Can call SPDFDIFF_OCR_COMMAND or tesseract for supported
image-only samples while preserving provenance and diagnostics.
Quickstart
Build the workspace from a checkout, then compare two PDFs with
the spdfdiff binary.
git clone https://github.com/eraydin/semantic-pdf-diff.git
cd semantic-pdf-diff
cargo build --workspace
.\target\debug\spdfdiff.exe diff .\old.pdf .\new.pdf --format json --output .\diff.json
CLI
diffCompare two PDF files and write JSON, AI JSON, Markdown, or HTML.
spdfdiff diff old.pdf new.pdf --format md --output diff.md
inspectInspect parser-level PDF structure and diagnostics.
spdfdiff inspect document.pdf --format json
extractExtract text blocks and semantic evidence from one PDF.
spdfdiff extract document.pdf --format json
checkRun configured PDF comparisons for CI and policy gates.
spdfdiff check --config .spdfdiff.toml
corpus and benchmarkEvaluate sample manifests and run deterministic benchmark smoke gates.
spdfdiff corpus samples --manifest samples/compatibility_corpus_manifest.json --fail-on-gate
spdfdiff benchmark --pages 50 --output benchmark.json
reviewSend AI review JSON to an optional local OpenAI-compatible HTTP endpoint.
spdfdiff review ai-review.json --endpoint http://127.0.0.1:8080/v1 --model local-model
Reports
JSON is the stable machine contract. AI review JSON is compact and prompt-ready. Markdown is readable in code review. HTML is a self-contained evidence report with inline overlays where bounding boxes are available.
| Format | Use case |
|---|---|
json |
Automated diff gates and artifact storage. |
ai-json |
Neutral review prompts and local LLM workflows. |
md |
Pull request notes and human-readable summaries. |
html |
Self-contained visual evidence review. |
CI
The repository includes a composite Action that runs
spdfdiff check, uploads deterministic artifacts, and fails the
workflow when configured thresholds are exceeded.
schema_version = "1"
output_dir = "target/spdfdiff-check"
formats = ["json", "html"]
fail_on_changes = true
[[pairs]]
name = "contract"
old = "old.pdf"
new = "new.pdf"
baseline = "approved-contract-diff.json"
max_diagnostics = 0
name: PDF semantic diff
on: [pull_request]
jobs:
pdf-diff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: eraydin/semantic-pdf-diff@main
with:
config: .spdfdiff.toml
Schemas
Stable report schemas live in the repository so downstream tools can validate diff reports, AI review artifacts, and CI check summaries.
Compatibility
semantic-pdf-diff is a compatibility-gate project. It is useful
for controlled digitally generated PDFs and committed sample
scenarios, but it is not yet a broad public-alpha compatibility
claim for arbitrary real-world PDFs.
/ToUnicode and conservative Base14
Latin text extraction paths, structure-tree summaries,
incremental-update metadata, and resource-limit enforcement.
Architecture
Core PDF parsing stays in pdf_core. Content interpretation,
text extraction, semantic layout, diff matching, reporting, and
CLI orchestration remain separated so downstream crates do not
couple semantic diff logic directly to raw PDF object internals.