Hi Mark, Glad you found Scribe OCR <https://scribeocr.com/> useful. Regarding character support, all characters in the Windows-1252 <https://en.wikipedia.org/wiki/Windows-1252> character set should currently be supported. This includes æ and œ, so if you encountered issues with those characters that can be replicated, please let me know and I can investigate. Unfortunately, the ſ character is not included.
Including characters outside of this set is actually fairly involved, as it requires switching to a different encoding and embedded font format when writing the PDF (from simple [type 1] to composite [type 0]). However, I am already working on implementing this as it is required to support non-Latin languages, so it will probably be possible to add characters outside of the Windows-1252 set at some point in the next month. -Jeremiah On Wednesday, April 10, 2024 at 12:03:19 PM UTC-7 mar...@gmail.com wrote: > Hi Jeremiah, > > Thanks so much, this is a fantastic tool. I just tried using the Scribe > OCR website to edit an hocr file that was generated with Tesseract against > its source image, and it worked perfectly. I was also able to make some > edits then successfully generate and download a PDF containing the image > and edited text. This is great, just what I needed. > > The only issue that I ran into was that it doesn't seem to support the > Latin characters and ligatures that I need like, æ œ ſ, etc. That's > probably not a complicated fix on my end, I'll just have to dig around in > the source code. If you could point me in the right direction it would be > greatly appreciated. > > Thanks again for your hard work on this, I'll certainly be in touch with > more questions about Scribe. > > Mark > On Sunday 31 March 2024 at 04:18:09 UTC-4 Jeremiah wrote: > >> There currently is no desktop application, so running requires either (1) >> using the public site on scribeocr.com or (2) serving the files on your >> local system using an HTTP server. I added instructions to the README >> <https://github.com/scribeocr/scribeocr?tab=readme-ov-file#running> for >> running locally, which I will also paste below. >> git clone --recursive https://github.com/scribeocr/scribeocr.git cd >> scribeocr npm i npx http-server >> The site can then be visited from a browser at the location printed by >> `npx http-server`. >> >> On Saturday, March 30, 2024 at 12:25:34 PM UTC-7 zdenop wrote: >> >>> Hello Jeremiah, >>> >>> this looks very interesting and nice app. Any instructions for >>> installation? >>> >>> I just downloaded code from GH but recognizing text doesn't work for me: >>> >>> [image: image.png] >>> >>> BR, >>> >>> >>> Zdenko >>> >>> >>> so 30. 3. 2024 o 8:41 Jeremiah <jeremia...@gmail.com> napísal(a): >>> >>>> You can proofread and correct .hocr files made by Tesseract using >>>> scribeocr.com, which is an open source program I wrote to address >>>> difficulties proofreading OCR data. A video demo can be seen here >>>> <https://www.youtube.com/watch?v=aWDiq3t1EeA>, and the GitHub repo is >>>> here <https://github.com/scribeocr/scribeocr>. The program positions >>>> the glyphs precisely over the source image, which (in my experience) >>>> reduces the time spent proofreading by 90% versus other methods. A >>>> screenshot is below. >>>> >>>> [image: scribe_screenshot.PNG] >>>> >>>> >>>> Proofreading .pdfs created by Tesseract is unfortunately not possible, >>>> given that (as you experienced personally), the precise glyph >>>> metrics/positioning data is lost when exporting to .pdf. However, if you >>>> upload the source image alongside a .hocr file from Tesseract (with >>>> `hocr_char_boxes: '1'` to include glyph-level data), it should have much >>>> more information to position glyphs with. After proofreading is done, a >>>> .pdf can be exported using the site. Alternatively, you can run >>>> recognition directly in the browser using a built-in build of Tesseract, >>>> which will produce the most accurate overlay due to several changes to >>>> Tesseract to improve positioning. The site is still under active >>>> development, so if you try it and experience any issues please let me know >>>> via a Git Issue or email to ad...@scribeocr.com. >>>> On Friday, March 15, 2024 at 12:12:39 PM UTC-7 Mark Pellegrino wrote: >>>> >>>>> Hi Art, >>>>> >>>>> Thanks so much for this. These are very intriguing tools. I'll >>>>> definitely give Alethia a try. It seems more suited to my needs than >>>>> Abbyy. >>>>> I'll report back once I've done some experimentation. >>>>> >>>>> Best, >>>>> Mark >>>>> >>>>> On Wed, Mar 13, 2024 at 3:00 PM Art Rhyno <artr...@uwindsor.ca> wrote: >>>>> >>>>>> In addition to hocr, Tesseract can produce the alto format, and this >>>>>> allows the use of the Alethia editor [1] from the Prima folks. I haven’t >>>>>> done much correction of hand-written materials but Alethia seems >>>>>> flexible >>>>>> for a windows environment and exports the page format. You also can >>>>>> start >>>>>> with hocr and/or roundtrip between alto, hocr, page, and other xml >>>>>> formats >>>>>> with the ocr-fileformat project [2], which includes some Prima plumbing. >>>>>> >>>>>> Merlijn and the IA folks have great tools for combing hocr and images to >>>>>> make a lightweight PDF if that’s your end-goal [3]. >>>>>> >>>>>> >>>>>> >>>>>> Best, >>>>>> >>>>>> >>>>>> >>>>>> art >>>>>> >>>>>> --- >>>>>> >>>>>> 1. https://www.primaresearch.org/tools/Aletheia >>>>>> >>>>>> 2. https://github.com/UB-Mannheim/ocr-fileformat >>>>>> >>>>>> 3. https://git.archive.org/merlijn/archive-pdf-tools >>>>>> >>>>>> >>>>>> >>>>>> *From:* tesser...@googlegroups.com <tesser...@googlegroups.com> *On >>>>>> Behalf Of *Mark Pellegrino >>>>>> *Sent:* Wednesday, March 13, 2024 11:25 AM >>>>>> *To:* tesser...@googlegroups.com >>>>>> *Subject:* Re: [tesseract-ocr] Re: Post OCR Verification and Editing >>>>>> >>>>>> >>>>>> >>>>>> You don't often get email from mar...@gmail.com. Learn why this is >>>>>> important <https://aka.ms/LearnAboutSenderIdentification> >>>>>> >>>>>> Hi Zdenko, >>>>>> >>>>>> >>>>>> >>>>>> Thank you so much for your continued interest. I'll provide a little >>>>>> more context; I work for a rare book library in Canada and I have around >>>>>> 10,000 pages of digitized, hand-written, latin manuscripts that I'm >>>>>> trying >>>>>> to OCR. >>>>>> >>>>>> >>>>>> >>>>>> I normally use Abbyy OCR Editor, which has good recognition but >>>>>> struggles with Latin, particularly with ligatures or antiquated >>>>>> characters >>>>>> like a long-s. Tesseract used with the training data available from >>>>>> latirocr.org <http://latirocr.org/> has much better recognition, >>>>>> near perfect. However, my issue with Tesseract is that I am unable to >>>>>> define a recognition area in the image, and therefore many unwanted >>>>>> elements on the page like smudges, pen marks, tears, decorative >>>>>> elements, >>>>>> etc, are also recognized with jumbled characters. I understand that I >>>>>> can >>>>>> preprocess the image in Photoshop to remove these unwanted elements, >>>>>> then >>>>>> generate hocr with Tesseract, then merge the hocr with the original >>>>>> unprocessed image, but on my scale that's particularly laborious. I was >>>>>> hoping to OCR all of the images then use an OCR editor like Acrobat or >>>>>> Abbyy to edit out any unwanted characters or inspect the OCR for >>>>>> accuracy, >>>>>> but it appears the Tesseract's usage of a Glyph Less font makes that >>>>>> impossible. >>>>>> >>>>>> >>>>>> >>>>>> Here's what happens if I try to open a Tesseract-made PDF in Acrobat. >>>>>> Like you mentioned, it opens just fine, but when the 'Make OCR Visible' >>>>>> option is enabled all of the text turns into black boxes (it's not an >>>>>> issue >>>>>> of redaction). My understanding is that because of the lack of any >>>>>> embedded >>>>>> font information in the file, Acrobat can't make sense of the text layer >>>>>> because there are no associated glyphs to present on screen. Tesseract >>>>>> PDFs >>>>>> won't open in Abbyy OCR Editor or FineReader at all, I'm guessing for >>>>>> the >>>>>> same reason. >>>>>> >>>>>> >>>>>> >>>>>> Thanks for reading. I'll look further into hocr editing tools. I'm >>>>>> hoping other institutions can share their procedures for similar >>>>>> projects. >>>>>> >>>>>> >>>>>> >>>>>> All the best, >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Mar 9, 2024 at 12:52 PM Zdenko Podobny <zde...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> " there's no way to use an off-the-shelf text editor with a glyphless >>>>>> font." >>>>>> >>>>>> I converted >>>>>> https://github.com/tesseract-ocr/test/blob/main/testing/8087_054.3B.tif >>>>>> to pdf >>>>>> >>>>>> tesseract 8087_054.3B.tif 8087_054.3B pdf >>>>>> >>>>>> >>>>>> >>>>>> I could open 8087_054.3B.pdf without a problem in Acode Acrobat Pro >>>>>> Version 2023.008.20555 64 bit (on Windows 11) >>>>>> >>>>>> However, it seems that it ignores tesseract text layer and it ran its >>>>>> own text recognition (including font identification). >>>>>> >>>>>> >>>>>> >>>>>> I tried to open 8087_054.3B.pdf at >>>>>> https://www.pdffiller.com/jsfiller-desk14/?flat_pdf_quality I can >>>>>> modify the text: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Also https://tinywow.com/pdf/edit seems to work: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> IMO if pdf tool offers text editing, it should work with tesseract >>>>>> output too. >>>>>> >>>>>> >>>>>> >>>>>> BR, >>>>>> >>>>>> >>>>>> Zdenko >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> pi 8. 3. 2024 o 20:24 Mark Pellegrino <mar...@gmail.com> napísal(a): >>>>>> >>>>>> Thank you Merlijn, this is very helpful. I'm very interested in IA's >>>>>> process so I'll have a deep dive through those tools. This confirms my >>>>>> suspicions that there's no way to use an off-the-shelf text editor with >>>>>> a >>>>>> glyphless font. I'll explore these hOCR editor options. All the best, >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Mar 8, 2024 at 7:03 AM Merlijn B.W. Wajer <mer...@archive.org> >>>>>> wrote: >>>>>> >>>>>> Hi Mark, >>>>>> >>>>>> On 07/03/2024 20:53, Mark Pellegrino wrote: >>>>>> > I found more info here: >>>>>> > >>>>>> https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277 >>>>>> > >>>>>> > Glyphless appears to be an 'invisible font' and all that Tesseract >>>>>> > supports. It seems like the solution it to use Tesseract to >>>>>> generate >>>>>> > hOCR, then use another tool to combine the source image with the >>>>>> hOCR? >>>>>> > >>>>>> > Does anyone have a simple workflow for editing/correcting Tesseract >>>>>> OCR >>>>>> > documents that they can share? >>>>>> >>>>>> If you're looking to do OCR and PDF generation separately, you might >>>>>> want to look into the Internet Archive's PDF generation tooling, >>>>>> which >>>>>> is designed to do exactly this (plus some aggressive compression): >>>>>> https://github.com/internetarchive/archive-pdf-tools (disclaimer: >>>>>> I'm >>>>>> the author of the tooling) >>>>>> >>>>>> As for viewing and editing hOCR, there's a lot of different tools >>>>>> around, not all fully functional (I haven't tried most of these): >>>>>> >>>>>> * https://www.not-implemented.de/hocr-proofreader/ >>>>>> * https://github.com/kba/hocrjs >>>>>> * https://github.com/GeReV/hocr-editor-ts / >>>>>> https://github.com/GeReV/HocrEditor >>>>>> >>>>>> There are also some GUI tools that I recall for editing hOCR, but >>>>>> they >>>>>> might require you to convert to another format first. >>>>>> >>>>>> Regards, >>>>>> Merlijn >>>>>> >>>>>> >>>>>> > >>>>>> > Thanks again, >>>>>> > >>>>>> > On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote: >>>>>> > >>>>>> > Hello, >>>>>> > I'm trying to check PDFs made with Tesseract 5.2 for correctness >>>>>> > using an OCR editor but am unable to open them in either Abbyy >>>>>> or >>>>>> > Acrobat. >>>>>> > >>>>>> > If I try to open a Tesseract PDF with Abbyy FineReader/OCR >>>>>> Editor, >>>>>> > the software just hangs and crashes. I can open Tesseract PDFs >>>>>> with >>>>>> > Acrobat Pro, but when I enable the 'Make OCR text visible' >>>>>> option >>>>>> > in Preflight, all of the text layer turns into unreadable black >>>>>> > boxes. The font used shows as 'GlyphLessFont' and appears to be >>>>>> > embedded in the file. >>>>>> > >>>>>> > It doesn't matter what training data I use, or what the source >>>>>> image >>>>>> > was, I always get these results. Any other non-Tesseract made >>>>>> PDF >>>>>> > works just fine. I'm guessing that the issue is a missing font? >>>>>> I >>>>>> > don't have much of an understanding about how embedded PDF fonts >>>>>> > work and I haven't found anything about this in the Tesseract >>>>>> docs. >>>>>> > Can someone please point me in the right direction? I Thanks. >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > You received this message because you are subscribed to the Google >>>>>> > Groups "tesseract-ocr" group. >>>>>> > To unsubscribe from this group and stop receiving emails from it, >>>>>> send >>>>>> > an email to tesseract-oc...@googlegroups.com >>>>>> > <mailto:tesseract-oc...@googlegroups.com>. >>>>>> > To view this discussion on the web visit >>>>>> > >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com >>>>>> >>>>>> < >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer >>>>>> >. >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to a topic in >>>>>> the Google Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this topic, visit >>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe >>>>>> . >>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>> tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/57874f8e-be02-4556-b15e-4b2bcb8fb927%40archive.org >>>>>> . >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to a topic in >>>>>> the Google Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this topic, visit >>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe >>>>>> . >>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>> tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w9mR%3Dr0eC%3DTO7-bv5PZRZpNHTnN8C2OwkqKRBpipMA%3Dw%40mail.gmail.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w9mR%3Dr0eC%3DTO7-bv5PZRZpNHTnN8C2OwkqKRBpipMA%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhY7Zv8K5H-ofXuxs9R4xpX7aAaSj7GGA8f7hvkKC3Ap%2Bg%40mail.gmail.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhY7Zv8K5H-ofXuxs9R4xpX7aAaSj7GGA8f7hvkKC3Ap%2Bg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to a topic in >>>>>> the Google Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this topic, visit >>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe >>>>>> . >>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>> tesseract-oc...@googlegroups.com. >>>>>> >>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB98895E77BA42515B116768B5DC2A2%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB98895E77BA42515B116768B5DC2A2%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/051f8108-e735-4401-9b0d-32d4cb292ff9n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/051f8108-e735-4401-9b0d-32d4cb292ff9n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/df501350-63de-4984-8a86-7831ccfb1477n%40googlegroups.com.