I'm using PowerShell 7 right now to automate Tesseract. This is my input image:
[image: Code.png] I'm able to accurately get the code recognized using this: & tesseract.exe "D:\Dev\OCR\Images\Code.png" stdout --psm 3 --oem 1 -l eng *The above command outputs:* [image: Code_2ABLEus0Tk.png] Which is great. But I have no idea how to actually reconstruct the original indentation of the code from the input image. Questions: 1. Is there a known and simple process that I can follow to reconstruct the indentation? 2. -c preserve_interword_spaces=1 doesn't seem to do anything. 3. Would tsv or hocr output be applicable here? And if so, which format would be the best for this task? 4. It seems that hocr is generally HTML with bounding box information... is there a way to convert this to the original indentation from the image somehow? Has anyone here arrived at a workable solution for extracting code from an image and keeping its alignment? There is an AI powered app called Pieces that seems to do this perfectly ( https://pieces.app/) I dug into their source code and found references to tesseract, so I think they are using it under the hood for OCR. But I have no clue how they are reconstructing the indentation. Any help or direction would be greatly appreciated. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/29ecbb32-5704-4396-8de4-47bf59c158bbn%40googlegroups.com.