Free web tool

PDF to Structured Data

Q: What file formats does it accept?

PDF files up to 20 MB. For other formats (Word, PowerPoint, Excel, HTML, EPUB), use the File to Markdown tool instead.

Turn a PDF into reading-order Markdown and RAG-ready JSON — tables, math as LaTeX, and figure captions preserved. Useful for feeding papers into AI and retrieval pipelines. Your file is processed in memory and never stored.

Drag a .pdf here, or click to choose. Max 20 MB.

Your PDF is processed in memory and discarded immediately — nothing is stored. Extraction runs on our own infrastructure using OpenDataLoader (Apache-2.0); your document never leaves it.

How to extract structured data from a PDF

Drop a .pdf in the box, or click to choose one. Up to 20 MB.
Click Extract structure. Parsing runs server-side and takes a few seconds.
Switch between the Markdown and JSON tabs to see reading-order text or the structured tree.
Download or copy whichever output you need for your pipeline.

About this tool

This tool extracts a PDF's content as structured data: reading-order Markdown for humans, and a JSON tree (blocks, tables, math, figure captions, bounding boxes) for AI and retrieval pipelines. Extraction is deterministic and runs locally on our infrastructure.

What file formats does it accept?

PDF files up to 20 MB. For Word, PowerPoint, Excel, HTML, or EPUB, use the File to Markdown tool instead.

Is my PDF stored?

No. The PDF is processed in memory in an ephemeral container and discarded immediately. Extraction runs on Purplelink's own infrastructure — your document is never sent to any third-party service, and nothing is written to durable storage or logs.

How is this different from the File to Markdown tool?

File to Markdown is multi-format but simpler. This tool is PDF-only and extraction-grade: it preserves reading order, detects tables, renders math as LaTeX, and returns structured JSON suited to RAG and AI pipelines.

Does it handle scanned PDFs?

Not yet. This version reads PDFs that already contain a text layer. Scanned, image-only PDFs need OCR, which isn't in this free tool.

From the team behind these tools

Writing LaTeX on a Mac?

We're building ModernTex - a native macOS LaTeX studio. Join the waitlist for one email at launch.

We'll only use your email to notify you at launch. Privacy Policy · Learn more about ModernTex →

If this saves you time, you can leave a tip — it helps keep these tools free and online.