Creating a Searchable PDF from hOCR inputhocr2pdf: Is a command line front-end for the image processing library to create perfectly layouted, searchable PDF files from hOCR, annotated HTML, input obtained from an OCR system.
hOCR, annotated HTML, input must be provided to STDIN, and the image data is read using the filename from the -i or --input argument. For example:
hocr2pdf -i scan.tiff -o test.pdf < cuneiform-out.hocr
By default the text layer is hidden by the real image data. Including image data can be disabled via the -n, --no-image, so that just the recognized text from the OCR is visible - e.g. for debugging or to save storage space:
hocr2pdf -i scan.tiff -n -o test.pdf < cuneiform-out.hocr 
Tips & TricksToo many gabs between letters in individual words
This might be a problem with imprecise OCR data or justified text with huge gabs. ExactImage includes a special mode activated with the command line argument -s, --sloppy-text, to group glyphs between whitespace to words which can help PDF viewers to produce better results while cut and pasting text:
hocr2pdf -i scan.tiff -s -o test.pdf < cuneiform-out.hocr 
|