In the context of ctakes, I am not sure. But recently, I am using Google text recognition services and the results (of print texts) are really good. Maybe you could try that. Melvin
On Wed, Oct 18, 2017 at 10:19 PM, <abilash.mat...@cognizant.com> wrote: > Sean, > > What is the accuracy that you get from OCR? We are at 60-70% accuracy. > Most of the documents are 200 DPI ones. Also, are you using any other > software like Matlab for the OCR pre or post processing. > > Thanks, > Abilash Mathew > > -----Original Message----- > From: Mathew, Abilash (Cognizant) > Sent: Monday, October 16, 2017 8:37 PM > To: dev@ctakes.apache.org > Subject: RE: OCR engine used [EXTERNAL] > > Thanks Sean fir the quick reply and providing the valuable information. > > Regards, > Abilash Mathew > > -----Original Message----- > From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] > Sent: Monday, October 16, 2017 8:17 PM > To: dev@ctakes.apache.org > Subject: RE: OCR engine used [EXTERNAL] > > Hi Abilash Mathew, > > I have only used Tesseract. Unfortunately, no ocr is perfect. > I am by no means an expert on Tesseract, but perhaps I can help to get you > started ... > > There are tricks that you can use to get it to work better with medical > notes (besides training on fonts). Possibly the most effective is using a > whitelist of desired characters using tessedit_char_whitelist and a series > of characters that doesn't include things like hash, dollar, bar ... > Another is to add a wordlist that contains words pertinent to your domain. > See: > https://github.com/tesseract-ocr/tesseract/wiki/ > ImproveQuality#dictionaries-word-lists-and-patterns > https://github.com/tesseract-ocr/tesseract/blob/master/doc/ > tesseract.1.asc#config-files-and-augmenting-with-user-data > https://stackoverflow.com/questions/9568165/custom- > dictionary-for-tesseract > https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg10100.html > > Good luck, > Sean > > -----Original Message----- > From: abilash.mat...@cognizant.com [mailto:abilash.mat...@cognizant.com] > Sent: Monday, October 16, 2017 10:13 AM > To: dev@ctakes.apache.org > Subject: OCR engine used [EXTERNAL] > > Hi All, > > Can you guys give some of the OCR engines used for Medical record text > extraction from images? I am currently using tesseract and seeing some > text extraction quality issues. > > Thanks, > Abilash Mathew > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. If you are not the intended recipient(s), please reply to the > sender and destroy all copies of the original message. Any unauthorized > review, use, disclosure, dissemination, forwarding, printing or copying of > this email, and/or any action taken in reliance on the contents of this > e-mail is strictly prohibited and may be unlawful. Where permitted by > applicable law, this e-mail and other e-mail communications sent to and > from Cognizant e-mail addresses may be monitored. > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. If you are not the intended recipient(s), please reply to the > sender and destroy all copies of the original message. Any unauthorized > review, use, disclosure, dissemination, forwarding, printing or copying of > this email, and/or any action taken in reliance on the contents of this > e-mail is strictly prohibited and may be unlawful. Where permitted by > applicable law, this e-mail and other e-mail communications sent to and > from Cognizant e-mail addresses may be monitored. >