1
0
mirror of https://github.com/kmein/niveum synced 2026-03-16 10:11:08 +01:00
Files
niveum/packages/pdf-ocr.nix
Kierán Meinhardt 4fc29ff0fe package .bin/ scripts as proper nix packages, delete .bin/
Packaged 14 scripts from .bin/ into packages/ with proper dependency
declarations (writers.writeDashBin/writeBashBin/writePython3Bin):
- 256color → two56color (terminal color chart)
- avesta.sed → avesta (Avestan transliteration)
- bvg.sh → bvg (Berlin transit disruptions)
- unicode → charinfo (Unicode character info)
- chunk-pdf → chunk-pdf (split PDFs by page count)
- csv2json → csv2json (CSV to JSON converter)
- fix-sd.sh → fix-sd (exFAT SD card recovery, improved output handling)
- json2csv → json2csv (JSON to CSV converter)
- mp3player-write → mp3player-write (audio conversion for MP3 players)
- mushakkil.sh → mushakkil (Arabic diacritization)
- nix-haddock-index → nix-haddock-index (GHC Haddock index generator)
- pdf-ocr.sh → pdf-ocr (OCR PDFs via tesseract)
- prospekte.sh → prospekte (German supermarket flyer browser)
- readme → readme (GitHub README as man page)

All added to overlay and packages output. .bin/ directory removed.
2026-02-17 21:32:10 +01:00

30 lines
541 B
Nix

# OCR a PDF file to text using tesseract
{
writers,
poppler_utils,
tesseract,
coreutils,
}:
writers.writeDashBin "pdf-ocr" ''
set -efu
pdf_path="$(${coreutils}/bin/realpath "$1")"
[ -f "$pdf_path" ] || {
echo "Usage: pdf-ocr FILE.pdf" >&2
exit 1
}
tmpdir="$(${coreutils}/bin/mktemp -d)"
trap 'rm -rf $tmpdir' EXIT
cd "$tmpdir"
${poppler_utils}/bin/pdftoppm -png "$pdf_path" pdf-ocr
for png in pdf-ocr*.png; do
${tesseract}/bin/tesseract "$png" "$png.txt" 2>/dev/null
done
cat pdf-ocr-*.txt
''