Turn PDF documents into data sources

pdfQL is a query language built for PDFs that turns them into a source of reliable structured data.

PDFDATA makes using pdfQL easy, with a set of simple in-browser tools anyone can use, and a set of APIs for developers to automate workflows.

Early bird access is open!

All paid plans include premium pdfQL authoring during this "early bird" phase. Let us write and maintain your pdfQL queries for you!

Get Started Schedule a Demo

What does pdfQL look like?

Here's an example pdfQL query that matches employer ID numbers as found in many U.S. tax forms, and the extracted data that it identifies and outputs, rendered (by default) as JSON:

pdfQL: the best parts of SQL, for PDFs

  • Uses a declarative model: describe the shape of data you want, not how to find it
  • Indexing of data primitives to make queries fast
  • Offers arbitrary predicates and familiar relations (equivalent to AND, OR, and NOT) to filter results with unlimited flexibility

Queries, not fragile, costly programs

Stop using regular programming languages to build fragile "parsers" for badly-extracted PDF text that can't express even simple spatial relationships and style expectations.

pdfQL queries are declarative, and know everything about your source documents: on-page positions, font styles and sizes and colors, and spatial relationships between content and lines and boxes.

Data how you want it

You get to name every data element in your pdfQL queries, and thus every data element produced by running them. Match those names to those used in your existing databases and systems for easy-peasy ingestion.

Flip one query flag, and you can receive extracted data in JSON or CSV to match whatever is easiest with your existing tools and skills.

Ready to treat PDFs as just another data source?

Add yourself to the pdfQL early access pool. You'll be the first to know when it's ready, and will get an update every week or two otherwise.