I had used ghostview to convert PDF to tif or png.

You can ocr PDF directly with gimagereader using the traineddata file I
sent.

See links for new windows binaries in msg below.


At last, here are some fresh builds:

https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe

I'd be also interested in testing of the tessdata manager, which should now
also properly handle script tessdatas

On Tue 26 Jun, 2018, 10:59 PM yajva, <nsvnarasi...@gmail.com> wrote:

> The doc is diff ver of the same text. Here's the doc used for the first.
> png. This is slightly darker, but the one sent earlier is cleaner. Let me
> know which is more amenable for OCRing. I use PDF Shaper to extract images
> and convert to png using xnview.
>
> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>
>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>
>> How did you create the test png from the pdf? I am not getting as good
>> quality, tried various settings with irfanview.
>>
>>
>>
>> On Tue, Jun 26, 2018 at 4:58 PM yajva <nsvnar...@gmail.com> wrote:
>>
>>> Sorry for the delay, my system was down.
>>>
>>> I am getting "Page not Found" for the link given. Can you pl re-check?
>>>
>>> Here's the doc I am trying to OCR
>>>
>>>
>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote:
>>>>
>>>> Please test with traineddata file from
>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1
>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw>
>>>>
>>>> Need to check that is it not overfitted.
>>>>
>>>> Please share a couple more images which I can use for testing.
>>>>
>>>>
>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva <nsvnar...@gmail.com> wrote:
>>>>
>>>>> one more correction.
>>>>>
>>>>>
>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>>>>>
>>>>>> done
>>>>>>
>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree wrote:
>>>>>>>
>>>>>>> I am attaching the OCRed text. Please correct it so that  I can use
>>>>>>> as groundtruth for further training and testing.
>>>>>>>
>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar <shree...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I had done a training for sanskrit for both devanagari and IAST but
>>>>>>>> it does not include cedilla for Sh
>>>>>>>>
>>>>>>>> I will add it and let you know.
>>>>>>>>
>>>>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva, <nsvnar...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I have tried Google OCR for recognizing Sanskrit text in Roman
>>>>>>>>> with diacritics (IAST). It recognizes above macron but not dots below 
>>>>>>>>> also
>>>>>>>>> joining grave and accent. Is there any traineddata available for 
>>>>>>>>> tesseract
>>>>>>>>> that can do this with good accuracy ? Attached a sample page that I am
>>>>>>>>> interested in.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> ____________________________________________________________
>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXsxSBLp-VMizQkZDVTFJqJTy8xK%3DvKLTHKkt85xjX_Jg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to