[tesseract-ocr] Using Tesseract 5.5.0 to recognize source code, but need a way to maintain original indentation.

Jay S Tue, 22 Apr 2025 20:34:32 -0700

I'm using PowerShell 7 right now to automate Tesseract. 

This is my input image:

[image: Code.png]

I'm able to accurately get the code recognized using this:

& tesseract.exe "D:\Dev\OCR\Images\Code.png" stdout --psm 3 --oem 1 -l eng

*The above command outputs:*

[image: Code_2ABLEus0Tk.png]
Which is great. But I have no idea how to actually reconstruct the original
indentation of the code from the input image.

Questions:
1. Is there a known and simple process that I can follow to reconstruct the
indentation?
2. -c preserve_interword_spaces=1 doesn't seem to do anything.
3. Would tsv or hocr output be applicable here? And if so, which format
would be the best for this task?
4. It seems that hocr is generally HTML with bounding box information... is
there a way to convert this to the original indentation from the image
somehow?

Has anyone here arrived at a workable solution for extracting code from an
image and keeping its alignment?

There is an AI powered app called Pieces that seems to do this perfectly (
https://pieces.app/)
I dug into their source code and found references to tesseract, so I think
they are using it under the hood for OCR. But I have no clue how they are
reconstructing the indentation.

Any help or direction would be greatly appreciated.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/29ecbb32-5704-4396-8de4-47bf59c158bbn%40googlegroups.com.

[tesseract-ocr] Using Tesseract 5.5.0 to recognize source code, but need a way to maintain original indentation.

Reply via email to