On Apr 19, 2009, at 5:16 AM, Debayan Banerjee wrote: > I take the liberty of top posting since i copied the mail's contents > from archives and bottom posting will require messing with the text > below to much. In reply to this particular line: > " It takes the old "matra removal" approach, and he's > facing the same problems I did (notice in his first example that গ > is > segmented into 2 parts, and শু is not)." > > Kindly see > http://picasaweb.google.com/debayanin/TesseractIndicOCR#5325782929614608690 > . > > Below is the original conversation. > > On 7/2/08, Golam Mortuza Hossain <[EMAIL PROTECTED]> wrote: >> On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta <[EMAIL >> PROTECTED]> >> >>> This guy seems to be doing some interesting progress for a Bangla >>> OCR >>> - or more precisely, enabling Bangla in Tesseract. >>> http://debayanin.googlepages.com/hackingtesseract > > Cool. I had some interaction with the tesseract/ocropus folks, and it > sounded like a good base. It's nice that someone's actually doing > something with it. It takes the old "matra removal" approach, and he's > facing the same problems I did (notice in his first example that গ > is > segmented into 2 parts, and শু is not). On the other hand, having > something that works even partly is a good start. > >> Yes, it looks definitely interesting. >> >>> Looks like he needs some more training data - can we provide him >>> with some >> ? >> >> If I remember correctly, there was a sample file for testing >> completeness >> of Bengali fonts. Since it has all letters and conjuncts typed-in, >> the >> file might >> be useful for training Tesseract as well . >> >> Deepayan should be able to give some input here. He has working >> experience >> with R and may have some training sample as well. > > Well, we have a bunch of unicode documents. For some of them, I have > print versions too, and can scan them if needed. A simpler approach > would be to render them using different fonts and take screenshots. > > Apparently he also needs some box-files, whatever they are, which need > to be produced using tesseract. I haven't installed tesseract yet, and > will try, but let me know if anyone else manages. > > -Deepayan > > >
Dear all, I was working with OCR for my university. I took most of the idea from bocra.sourceforge.net It is written using graphicsmagick library & C++. Any suggestion from you about matching alphabet. Here is my progress.... http://picasaweb.google.com/salahuddin66/OCR# regards salahuddin salahuddin66.blogspot.com > > -- > Be Intelligent, Use GNU/Linux > > http://debayanin.googlepages.com/ > http://debayan.wordpress.com > http://lug.nitdgp.ac.in > > ------------------------------------------------------------------------------ > Stay on top of everything new and different, both inside and > around Java (TM) technology - register by April 22, and save > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. > 300 plus technical and hands-on sessions. Register today. > Use priority code J9JMT32. http://p.sf.net/sfu/p > _______________________________________________ > Bengalinux-core mailing list > Bengalinux-core@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bengalinux-core ------------------------------------------------------------------------------ Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p _______________________________________________ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core