OCR of source code with tesseract is a problem:
- tesseract is not focused on keeping spaces/indentation - you have to
reconstruct it by yourself (e.g. by parsing horcr output)
- tesseract is focused more on "real" text, while source code is more
symbolic with a lot of extra character,
Here is a simple code, that works for me (with tesseract 5 and leptonica
1.82)
#include
#include
#include
#include
int main() {
const char* datapath = "f:/Project-Personal/tessdata_best/tessdata";
std::string language_ = "eng";
std::string inputFile_ = "input.png";
const char*
this is my old snippet, so part of the code is useless for pdf rendering
(opening the input image as PIX).
Zdenko
po 22. 11. 2021 o 14:28 Zdenko Podobny napísal(a):
> Here is a simple code, that works for me (with tesseract 5 and leptonica
> 1.82)
>
> #include
> #include
> #include
> #inclu
this code can read text?
On Mon, 22 Nov 2021, 21:28 Zdenko Podobny, wrote:
> Here is a simple code, that works for me (with tesseract 5 and leptonica
> 1.82)
>
> #include
> #include
> #include
> #include
>
> int main() {
> const char* datapath = "f:/Project-Personal/tessdata_best/tessdat
I do not understand your question: how it is related to the discussed topic?
Zdenko
po 22. 11. 2021 o 14:34 Sarah Jane CHANNEL
napísal(a):
> this code can read text?
>
> On Mon, 22 Nov 2021, 21:28 Zdenko Podobny, wrote:
>
>> Here is a simple code, that works for me (with tesseract 5 and lepto
It works!
I tried tesseract-ocr 5.0.0 RC2 + leptonica 1.8.2 and with it, my and your
code worked flawlessly. It seems like the 4.1.3 has a bug in it, that has
been fixed in 4.1.3. I didn't tested 5.0, because I thought It would be
more unstable.
I extra tested 4.1.3 + leptonica 1.8.2 (was on 1.
Hey zdenop,
turns out I can't rely on 5.0.0, because OpenCV seems to only is compatible
with 4.x yet. (OpenCV is another requirement of my project).
Does your script from above works on tesseract 4.x for you?
blaumedia schrieb am Montag, 22. November 2021 um 18:51:38 UTC+1:
> It works!
>
> I tr
Thanks a lot Zdenko, I am disappointed but th'as life :-(
Le lundi 22 novembre 2021 à 12:42:23 UTC+1, zdenop a écrit :
> OCR of source code with tesseract is a problem:
>
>- tesseract is not focused on keeping spaces/indentation - you have to
>reconstruct it by yourself (e.g. by parsin
Hello,
yes, it works for me also with tesseract 4.1.3 (the latest version). AFAIR
there was no change in behaviour of renderer (including TessPDFRenderer)
from the 4.0-beta version.
Also, I did not get your problem with OpenCV - AFAIK tesseract is the only
optional dependency and it uses only very
9 matches
Mail list logo