[tesseract-ocr] Re: Chinise characters.

ziyan xu Thu, 18 Jul 2024 10:32:22 -0700

你好，请问一下用的是哪个版本呀，方便分享一下你的chi_sim 和chi_sim_vert 的文件嘛？


在2024年3月17日星期日 UTC+8 00:41:13<j.w.p...@gmail.com> 写道：

> Hello, 
>
> I am making a transcrypt of YT wideos using tessaract. 
> Images I input to tessaract look like this:
> [image: aftercut29.0.jpg]
>
> The output is mostly correct but sometimes the same character give 
> numerous output.
> Example: 
> Input:
> [image: aftercut3.0.jpg]
> Output: 大*叔*中文 - CORRECT
>
> Input:
> [image: aftercut10.5.jpg] 
> Output: 今天不是3位 大*档* - INCORRECT
>
> In preparation of the images I use:
>
>    -  *dilatation*, 
>    - *cropping the area* of image containg characters
>    -  I add *borders*.
>
>  For dilatation I use 2x2 kernel and the border is 2px thick.
>  For segmentation method I am currently experimentig with *psg --7 *and *psg 
> -- 13*. psg --7 seems to give a bit better results. Of course the 
> language is : *lang='chi_sim'*
>
> Could you give my any advice how to improve the robustness of the output?
>
> Thank you in advance,
> Jan
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4e681cab-de35-4976-9cab-f085ae600f11n%40googlegroups.com.

[tesseract-ocr] Re: Chinise characters.

Reply via email to