Unjs/unpdf: PDF extraction and rendering across all JavaScript runtimes (github.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

unpdf is a cross-runtime utility library that bundles a serverless-optimized build of Mozilla’s PDF.js to provide reliable PDF extraction and rendering across Node.js, Deno, Bun, browser and edge/serverless environments. It’s pitched as a modern, zero-dependency alternative to pdf-parse and is aimed squarely at serverless AI applications—think document ingestion, summarization, link extraction and image extraction for downstream LLM pipelines. The package ships with a custom serverless PDF.js (built from v5.4.149) that inlines the worker and applies platform tweaks (string replacements, global mocks) so it runs on edge runtimes out of the box; you can also opt into the official or legacy PDF.js build if needed. Key features and technical notes: high‑level APIs include extractText (optionally mergePages to return a single string), extractLinks, extractImages (returns raw pixel buffers with width/height/channels), renderPageAsImage (ArrayBuffer or data URL), and getResolvedPDFJS/definePDFJSModule for low‑level access. Image rendering in Node requires the @napi-rs/canvas package and the official PDF.js build; unpdf’s serverless bundle includes a polyfill for PDF.js v5.x’s Promise.withResolvers (important for Node <22). The library simplifies PDF preprocessing for AI workflows by removing build friction on edge platforms while preserving full access to PDF.js when you need more control.

Loading comments...

loading comments...