Hi, Gunnar, Do you think this SO question <https://stackoverflow.com/questions/49363954/using-arialmt-for-arabic-text-without-embedding-font-with-pdfbox> is related? I'm the OP and the (admittedly somewhat niche) case for no-glyph (i.e. non-renderable) chars on a PDF is a "capability" that's been missing for me.
To give some context, at work I'm responsible for a library that, among other things, overlays OCRed text (from diverse sources) on images placed in PDF pages. There have been issues I've overcome (especially concerning Unicode), but "glyphless font" embedding is something that would really make a noticeable impact on PDF size. Most OCR software that produce PDFs from images do this in some way, Tesseract included. I think PDFBox is a great library for reading and generating PDFs, and I'm seriously considering contributing as soon as possible. A big thanks to everyone working to make this project successful. C.D. -- There is a computer disease that anybody who works with computers knows about. It's a very serious disease and it interferes completely with the work. The trouble with computers is that you 'play' with them! - Richard P. Feynman On Thu, Mar 25, 2021 at 2:30 PM Gunnar Brand < gunnar.br...@interface-projects.de> wrote: > Hi. > > The process is as follows: > 1) For images: use the image > For PDFs: render each page to 300 dpi (since optimized PDFs don't > necessarily have a single big image), maybe even with text if text > extraction returned gibberish (missing unicode mapping). > 2) Use tesseract to OCR image/page with PDF and HOCR output. (for pages: > create an imageless PDF). The HOCR is used for additional page layout > information and word confidence values. > 3) For images, use the HOCR to filter the PDF text stream and add layout > information > For PDFs, insert the tesseract PDF text stream into the orignal PDF's > page (+add that glyphless font), use the HOCR to filter and add layout > information. > > For step 3, I would like to use a normal PDPageContentStream to add the > content instead of working with a raw stream. But that step fails since I > cannot use the showText() method with a Font that has an empty cmap. > > I attached an empty tesseract PDF with the glyphless font. Appending text > using the font to the single page in there will fail immediately with the > exception due to the empty cmap. Adding the font to any other PDF and > trying to show text using it will fail as well. > > I can probably get away with just creating/transfering the Tj commands > raw, but I was wondering if the empty cmap behaviour is ok or would it be > better to ignore empty cmaps (i.e. look for a non empty one first and > return null if none can be found in TrueTypeFont.getUnicodeCmapImpl). > > Gunnar > > > > -----Ursprüngliche Nachricht----- > Von: Tilman Hausherr <thaush...@t-online.de> > Gesendet: Donnerstag, 25. März 2021 04:37 > An: users@pdfbox.apache.org > Betreff: Re: Empty cmap in TTF Files. > > Am 24.03.2021 um 14:40 schrieb Gunnar Brand: > > Hi. > > > > I am working on merging original PDFs and the PDF/HOCR output of > Tesseract, as to create a searchable PDF. Transplanting the glyphless font > used by tesseract was no problem, it doesn’t matter if I simply use the > font in the original PDF or use cloneutil, when saving the file the font is > embedded properly. > > > > The problem is when I show text using a content stream, I get a “No > Glyph for …” exception. I traced this down to the glyphless font containing > empty cmap tables. There is a CIDToGIDMap. Coincidentally PDFBOX-5103 just > addressed this issue with a reverse mapping if the cmap is null. But the > cmap is just empty and will return 0 for any character code, so this new > feature will never work in this case. > > > > For testing I modified TrueTypeFont.getUnicodeCmapImpl(isStrict) so that > it ignores empty cmap subtables (even the fallback at the end of the > method now being a loop). With this PDFBox will happily use the tesseract > glyphless font. Now I lack the knowledge if empty cmaps make any sense at > all and if they do I will simply write raw show text commands, but maybe it > is something to consider? > > > > Gunnar > > I tried tesseract some time ago and it generates searchable PDFs out of > the box, why not use that? > > Can you upload one of your files to a sharehoster so that I understand > what this is about? > > Tilman > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org