Re: [tesseract-ocr] Re: Post OCR Verification and Editing

Jeremiah Thu, 11 Apr 2024 19:38:24 -0700

Hi Mark,

Glad you found Scribe OCR <https://scribeocr.com/> useful.  Regarding 
character support, all characters in the Windows-1252 
<https://en.wikipedia.org/wiki/Windows-1252> character set should currently 
be supported.  This includes æ and œ, so if you encountered issues with 
those characters that can be replicated, please let me know and I can 
investigate.  Unfortunately, the ſ character is not included.


Including characters outside of this set is actually fairly involved, as it 
requires switching to a different encoding and embedded font format when 
writing the PDF (from simple [type 1] to composite [type 0]).  However, I 
am already working on implementing this as it is required to support 
non-Latin languages, so it will probably be possible to add characters 
outside of the Windows-1252 set at some point in the next month. 

-Jeremiah

On Wednesday, April 10, 2024 at 12:03:19 PM UTC-7 mar...@gmail.com wrote:

> Hi Jeremiah,
>
> Thanks so much, this is a fantastic tool. I just tried using the Scribe 
> OCR website to edit an hocr file that was generated with Tesseract against 
> its source image, and it worked perfectly. I was also able to make some 
> edits then successfully generate and download a PDF containing the image 
> and edited text. This is great, just what I needed.
>
> The only issue that I ran into was that it doesn't seem to support the 
> Latin characters and ligatures that I need like, æ œ ſ, etc. That's 
> probably not a complicated fix on my end, I'll just have to dig around in 
> the source code. If you could point me in the right direction it would be 
> greatly appreciated.
>
> Thanks again for your hard work on this, I'll certainly be in touch with 
> more questions about Scribe. 
>
>  Mark
> On Sunday 31 March 2024 at 04:18:09 UTC-4 Jeremiah wrote:
>
>> There currently is no desktop application, so running requires either (1) 
>> using the public site on scribeocr.com or (2) serving the files on your 
>> local system using an HTTP server.  I added instructions to the README 
>> <https://github.com/scribeocr/scribeocr?tab=readme-ov-file#running> for 
>> running locally, which I will also paste below.
>> git clone --recursive https://github.com/scribeocr/scribeocr.git cd 
>> scribeocr npm i npx http-server 
>> The site can then be visited from a browser at the location printed by 
>> `npx http-server`.
>>
>> On Saturday, March 30, 2024 at 12:25:34 PM UTC-7 zdenop wrote:
>>
>>> Hello Jeremiah,
>>>
>>> this looks very interesting and nice app. Any instructions for 
>>> installation?
>>>
>>> I just downloaded code from GH but recognizing text doesn't work for me:
>>>
>>> [image: image.png]
>>>
>>> BR,
>>>
>>>
>>> Zdenko
>>>
>>>
>>> so 30. 3. 2024 o 8:41 Jeremiah <jeremia...@gmail.com> napísal(a):
>>>
>>>> You can proofread and correct .hocr files made by Tesseract using 
>>>> scribeocr.com, which is an open source program I wrote to address 
>>>> difficulties proofreading OCR data.  A video demo can be seen here 
>>>> <https://www.youtube.com/watch?v=aWDiq3t1EeA>, and the GitHub repo is 
>>>> here <https://github.com/scribeocr/scribeocr>.  The program positions 
>>>> the glyphs precisely over the source image, which (in my experience) 
>>>> reduces the time spent proofreading by 90% versus other methods.  A 
>>>> screenshot is below.
>>>>
>>>> [image: scribe_screenshot.PNG]
>>>>
>>>>
>>>> Proofreading .pdfs created by Tesseract is unfortunately not possible, 
>>>> given that (as you experienced personally), the precise glyph 
>>>> metrics/positioning data is lost when exporting to .pdf.  However, if you 
>>>> upload the source image alongside a .hocr file from Tesseract (with 
>>>> `hocr_char_boxes: '1'` to include glyph-level data), it should have much 
>>>> more information to position glyphs with.  After proofreading is done, a 
>>>> .pdf can be exported using the site.  Alternatively, you can run 
>>>> recognition directly in the browser using a built-in build of Tesseract, 
>>>> which will produce the most accurate overlay due to several changes to 
>>>> Tesseract to improve positioning.   The site is still under active 
>>>> development, so if you try it and experience any issues please let me know 
>>>> via a Git Issue or email to ad...@scribeocr.com. 
>>>> On Friday, March 15, 2024 at 12:12:39 PM UTC-7 Mark Pellegrino wrote:
>>>>
>>>>> Hi Art,
>>>>>
>>>>> Thanks so much for this. These are very intriguing tools. I'll 
>>>>> definitely give Alethia a try. It seems more suited to my needs than 
>>>>> Abbyy. 
>>>>> I'll report back once I've done some experimentation.
>>>>>
>>>>> Best,
>>>>> Mark
>>>>>
>>>>> On Wed, Mar 13, 2024 at 3:00 PM Art Rhyno <artr...@uwindsor.ca> wrote:
>>>>>
>>>>>> In addition to hocr, Tesseract can produce the alto format, and this 
>>>>>> allows the use of the Alethia editor [1] from the Prima folks. I haven’t 
>>>>>> done much correction of hand-written materials but Alethia seems 
>>>>>> flexible 
>>>>>> for a windows environment and exports the page format. You also can 
>>>>>> start 
>>>>>> with hocr and/or roundtrip between alto, hocr, page, and other xml 
>>>>>> formats 
>>>>>> with the ocr-fileformat project [2], which includes some Prima plumbing. 
>>>>>>  
>>>>>> Merlijn and the IA folks have great tools for combing hocr and images to 
>>>>>> make a lightweight PDF if that’s your end-goal [3].
>>>>>>
>>>>>>  
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>>  
>>>>>>
>>>>>> art
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> 1. https://www.primaresearch.org/tools/Aletheia
>>>>>>
>>>>>> 2. https://github.com/UB-Mannheim/ocr-fileformat
>>>>>>
>>>>>> 3. https://git.archive.org/merlijn/archive-pdf-tools
>>>>>>
>>>>>>  
>>>>>>
>>>>>> *From:* tesser...@googlegroups.com <tesser...@googlegroups.com> *On 
>>>>>> Behalf Of *Mark Pellegrino
>>>>>> *Sent:* Wednesday, March 13, 2024 11:25 AM
>>>>>> *To:* tesser...@googlegroups.com
>>>>>> *Subject:* Re: [tesseract-ocr] Re: Post OCR Verification and Editing
>>>>>>
>>>>>>  
>>>>>>
>>>>>> You don't often get email from mar...@gmail.com. Learn why this is 
>>>>>> important <https://aka.ms/LearnAboutSenderIdentification>
>>>>>>
>>>>>> Hi Zdenko, 
>>>>>>
>>>>>>  
>>>>>>
>>>>>> Thank you so much for your continued interest. I'll provide a little 
>>>>>> more context; I work for a rare book library in Canada and I have around 
>>>>>> 10,000 pages of digitized, hand-written, latin manuscripts that I'm 
>>>>>> trying 
>>>>>> to OCR.
>>>>>>
>>>>>>  
>>>>>>
>>>>>> I normally use Abbyy OCR Editor, which has good recognition but 
>>>>>> struggles with Latin, particularly with ligatures or antiquated 
>>>>>> characters 
>>>>>> like a long-s. Tesseract used with the training data available from 
>>>>>> latirocr.org  <http://latirocr.org/> has much better recognition, 
>>>>>> near perfect. However, my issue with Tesseract is that I am unable to 
>>>>>> define a recognition area in the image, and therefore many unwanted 
>>>>>> elements on the page like smudges, pen marks, tears, decorative 
>>>>>> elements, 
>>>>>> etc, are also recognized with jumbled characters. I understand that I 
>>>>>> can 
>>>>>> preprocess the image in Photoshop to remove these unwanted elements, 
>>>>>> then 
>>>>>> generate hocr with Tesseract, then merge the hocr with the original 
>>>>>> unprocessed image, but on my scale that's particularly laborious. I was 
>>>>>> hoping to OCR all of the images then use an OCR editor like Acrobat or 
>>>>>> Abbyy to edit out any unwanted characters or inspect the OCR for 
>>>>>> accuracy, 
>>>>>> but it appears the Tesseract's usage of a Glyph Less font makes that 
>>>>>> impossible. 
>>>>>>
>>>>>>  
>>>>>>
>>>>>> Here's what happens if I try to open a Tesseract-made PDF in Acrobat. 
>>>>>> Like you mentioned, it opens just fine, but when the 'Make OCR Visible' 
>>>>>> option is enabled all of the text turns into black boxes (it's not an 
>>>>>> issue 
>>>>>> of redaction). My understanding is that because of the lack of any 
>>>>>> embedded 
>>>>>> font information in the file, Acrobat can't make sense of the text layer 
>>>>>> because there are no associated glyphs to present on screen. Tesseract 
>>>>>> PDFs 
>>>>>> won't open in Abbyy OCR Editor or FineReader at all, I'm guessing for 
>>>>>> the 
>>>>>> same reason.
>>>>>>
>>>>>>  
>>>>>>
>>>>>> Thanks for reading. I'll look further into hocr editing tools. I'm 
>>>>>> hoping other institutions can share their procedures for similar 
>>>>>> projects.
>>>>>>
>>>>>>  
>>>>>>
>>>>>> All the best,
>>>>>>
>>>>>>  
>>>>>>
>>>>>>  
>>>>>>
>>>>>> On Sat, Mar 9, 2024 at 12:52 PM Zdenko Podobny <zde...@gmail.com> 
>>>>>> wrote:
>>>>>>
>>>>>> " there's no way to use an off-the-shelf text editor with a glyphless 
>>>>>> font."
>>>>>>
>>>>>> I converted  
>>>>>> https://github.com/tesseract-ocr/test/blob/main/testing/8087_054.3B.tif 
>>>>>> to pdf
>>>>>>
>>>>>> tesseract 8087_054.3B.tif 8087_054.3B pdf
>>>>>>
>>>>>>  
>>>>>>
>>>>>> I could open 8087_054.3B.pdf without a problem in Acode Acrobat Pro 
>>>>>> Version 2023.008.20555 64 bit (on Windows 11)
>>>>>>
>>>>>> However, it seems that it ignores tesseract text layer and it ran its 
>>>>>> own text recognition (including font identification).
>>>>>>
>>>>>>  
>>>>>>
>>>>>> I tried to open 8087_054.3B.pdf  at 
>>>>>> https://www.pdffiller.com/jsfiller-desk14/?flat_pdf_quality I can 
>>>>>> modify the text:
>>>>>>
>>>>>>  
>>>>>>
>>>>>>  
>>>>>>
>>>>>> Also https://tinywow.com/pdf/edit seems to work:
>>>>>>
>>>>>>  
>>>>>>
>>>>>>  
>>>>>>
>>>>>> IMO if pdf tool offers text editing, it should work with tesseract 
>>>>>> output too.
>>>>>>
>>>>>>  
>>>>>>
>>>>>> BR,
>>>>>>
>>>>>>
>>>>>> Zdenko
>>>>>>
>>>>>>  
>>>>>>
>>>>>>  
>>>>>>
>>>>>> pi 8. 3. 2024 o 20:24 Mark Pellegrino <mar...@gmail.com> napísal(a):
>>>>>>
>>>>>> Thank you Merlijn, this is very helpful.  I'm very interested in IA's 
>>>>>> process so I'll have a deep dive through those tools.  This confirms my 
>>>>>> suspicions that there's no way to use an off-the-shelf text editor with 
>>>>>> a 
>>>>>> glyphless font. I'll explore these hOCR editor options. All the best,
>>>>>>
>>>>>>  
>>>>>>
>>>>>> On Fri, Mar 8, 2024 at 7:03 AM Merlijn B.W. Wajer <mer...@archive.org> 
>>>>>> wrote:
>>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> On 07/03/2024 20:53, Mark Pellegrino wrote:
>>>>>> > I found more info here:
>>>>>> > 
>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1769#issuecomment-509490277
>>>>>> > 
>>>>>> > Glyphless appears to be an 'invisible font' and all that Tesseract 
>>>>>> > supports. It seems like the solution it to use Tesseract to 
>>>>>> generate 
>>>>>> > hOCR, then use another tool to combine the source image with the 
>>>>>> hOCR?
>>>>>> > 
>>>>>> > Does anyone have a simple workflow for editing/correcting Tesseract 
>>>>>> OCR 
>>>>>> > documents that they can share?
>>>>>>
>>>>>> If you're looking to do OCR and PDF generation separately, you might 
>>>>>> want to look into the Internet Archive's PDF generation tooling, 
>>>>>> which 
>>>>>> is designed to do exactly this (plus some aggressive compression): 
>>>>>> https://github.com/internetarchive/archive-pdf-tools (disclaimer: 
>>>>>> I'm 
>>>>>> the author of the tooling)
>>>>>>
>>>>>> As for viewing and editing hOCR, there's a lot of different tools 
>>>>>> around, not all fully functional (I haven't tried most of these):
>>>>>>
>>>>>> * https://www.not-implemented.de/hocr-proofreader/
>>>>>> * https://github.com/kba/hocrjs
>>>>>> * https://github.com/GeReV/hocr-editor-ts / 
>>>>>> https://github.com/GeReV/HocrEditor
>>>>>>
>>>>>> There are also some GUI tools that I recall for editing hOCR, but 
>>>>>> they 
>>>>>> might require you to convert to another format first.
>>>>>>
>>>>>> Regards,
>>>>>> Merlijn
>>>>>>
>>>>>>
>>>>>> > 
>>>>>> > Thanks again,
>>>>>> > 
>>>>>> > On Thursday 7 March 2024 at 14:17:28 UTC-5 Mark Pellegrino wrote:
>>>>>> > 
>>>>>> >     Hello,
>>>>>> >     I'm trying to check PDFs made with Tesseract 5.2 for correctness
>>>>>> >     using an OCR editor but am unable to open them in either Abbyy 
>>>>>> or
>>>>>> >     Acrobat.
>>>>>> > 
>>>>>> >     If I try to open a Tesseract PDF with Abbyy FineReader/OCR 
>>>>>> Editor,
>>>>>> >     the software just hangs and crashes. I can open Tesseract PDFs 
>>>>>> with
>>>>>> >     Acrobat Pro, but when I enable the  'Make OCR text visible' 
>>>>>> option
>>>>>> >     in Preflight, all of the text layer turns into unreadable black
>>>>>> >     boxes. The font used shows as 'GlyphLessFont' and appears to be
>>>>>> >     embedded in the file.
>>>>>> > 
>>>>>> >     It doesn't matter what training data I use, or what the source 
>>>>>> image
>>>>>> >     was, I always get these results. Any other non-Tesseract made 
>>>>>> PDF
>>>>>> >     works just fine. I'm guessing that the issue is a missing font? 
>>>>>> I
>>>>>> >     don't have much of an understanding about how embedded PDF fonts
>>>>>> >     work and I haven't found anything about this in the Tesseract 
>>>>>> docs.
>>>>>> >     Can someone please point me in the right direction? I Thanks.
>>>>>> > 
>>>>>> > 
>>>>>> > -- 
>>>>>> > You received this message because you are subscribed to the Google 
>>>>>> > Groups "tesseract-ocr" group.
>>>>>> > To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send 
>>>>>> > an email to tesseract-oc...@googlegroups.com 
>>>>>> > <mailto:tesseract-oc...@googlegroups.com>.
>>>>>> > To view this discussion on the web visit 
>>>>>> > 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com
>>>>>>  
>>>>>> <
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b43c0ea6-fd81-49af-b74f-e93b0a682574n%40googlegroups.com?utm_medium=email&utm_source=footer
>>>>>> >.
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to a topic in 
>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this topic, visit 
>>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe
>>>>>> .
>>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>>> tesseract-oc...@googlegroups.com.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/57874f8e-be02-4556-b15e-4b2bcb8fb927%40archive.org
>>>>>> .
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhbKP1QW1a80C4fSnXOepYAr54-KnA5YY29WSCML-sSyGg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to a topic in 
>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this topic, visit 
>>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe
>>>>>> .
>>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>>> tesseract-oc...@googlegroups.com.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w9mR%3Dr0eC%3DTO7-bv5PZRZpNHTnN8C2OwkqKRBpipMA%3Dw%40mail.gmail.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w9mR%3Dr0eC%3DTO7-bv5PZRZpNHTnN8C2OwkqKRBpipMA%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhY7Zv8K5H-ofXuxs9R4xpX7aAaSj7GGA8f7hvkKC3Ap%2Bg%40mail.gmail.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFPJFhY7Zv8K5H-ofXuxs9R4xpX7aAaSj7GGA8f7hvkKC3Ap%2Bg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to a topic in 
>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this topic, visit 
>>>>>> https://groups.google.com/d/topic/tesseract-ocr/d6ASNhJZUtw/unsubscribe
>>>>>> .
>>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>
>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB98895E77BA42515B116768B5DC2A2%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB98895E77BA42515B116768B5DC2A2%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/051f8108-e735-4401-9b0d-32d4cb292ff9n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/051f8108-e735-4401-9b0d-32d4cb292ff9n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/df501350-63de-4984-8a86-7831ccfb1477n%40googlegroups.com.

Re: [tesseract-ocr] Re: Post OCR Verification and Editing

Reply via email to