I'm using PowerShell 7 right now to automate Tesseract. 

This is my input image:

[image: Code.png]

I'm able to accurately get the code recognized using this:

& tesseract.exe "D:\Dev\OCR\Images\Code.png" stdout --psm 3 --oem 1 -l eng

*The above command outputs:*

[image: Code_2ABLEus0Tk.png]
Which is great. But I have no idea how to actually reconstruct the original 
indentation of the code from the input image.

Questions:
1. Is there a known and simple process that I can follow to reconstruct the 
indentation?
2. -c preserve_interword_spaces=1 doesn't seem to do anything.
3. Would tsv or hocr output be applicable here? And if so, which format 
would be the best for this task?
4. It seems that hocr is generally HTML with bounding box information... is 
there a way to convert this to the original indentation from the image 
somehow?

Has anyone here arrived at a workable solution for extracting code from an 
image and keeping its alignment?

There is an AI powered app called Pieces that seems to do this perfectly (
https://pieces.app/)
I dug into their source code and found references to tesseract, so I think 
they are using it under the hood for OCR. But I have no clue how they are 
reconstructing the indentation.

Any help or direction would be greatly appreciated.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/29ecbb32-5704-4396-8de4-47bf59c158bbn%40googlegroups.com.

Reply via email to