T2 SDE
SANE/Avision
ExactAudio
ExactImage
   Econvert
   Edisplay
   Edentify
   Hocr2PDF
   Utilities
   Image formats
   API
   Lossless JPEG
   Bardecode
   Releases
   Roadmap
   Feedback
EmbeddedSTL
eScreen
MacOSD
minised
ExactBuild
Openbench
Vorbisinfo

Creating a Searchable PDF from hOCR input

hocr2pdf: Is a command line front-end for the image processing library to create perfectly layouted, searchable PDF files from hOCR, annotated HTML, input obtained from an OCR system.

 

hOCR, annotated HTML, input must be provided to STDIN, and the image data is read using the filename from the -i or --input argument. For example:

 

hocr2pdf -i scan.tiff -o test.pdf < cuneiform-out.hocr

 

By default the text layer is hidden by the real image data. Including image data can be disabled via the -n, --no-image, so that just the recognized text from the OCR is visible - e.g. for debugging or to save storage space:

 

hocr2pdf -i scan.tiff -n -o test.pdf < cuneiform-out.hocr


Tips & Tricks

Too many gabs between letters in individual words

 

This might be a problem with imprecise OCR data or justified text with huge gabs. ExactImage includes a special mode activated with the command line argument -s, --sloppy-text, to group glyphs between whitespace to words which can help PDF viewers to produce better results while cut and pasting text:

 

hocr2pdf -i scan.tiff -s -o test.pdf < cuneiform-out.hocr


ImpressumAGB