Hi Djibril, I am afraid that this is an old topic and he may not work with invoices anymore. I am also interested in extracting information from invoices. Have you tried to use tesseract with a dictionary to improve accuracy? Because invoices have some particular data fields. You can see the manual here: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data Tell me if you have better result. I will also tell you if I have. Best,
Vào 20:32:50 UTC+1 Thứ Tư, ngày 06 tháng 12 năm 2017, Djibril Kaba đã viết: > > Hi Vinay, > > I am trying to solve the same problem here. Have you managed to get some > solution to your problem. Your help would be greatly appreciated. Looking > forward to hearing from you. > > Many thanks!! > > On Tuesday, November 18, 2014 at 8:53:08 PM UTC+1, Vinay Matam wrote: >> >> Hi All, >> >> I really need your help with one of the projects that I am working on. I >> am using Tesseract 3.02 on a Ubuntu machine. >> >> I have an invoice (please see the attached file). I want to extract some >> information from that invoice like Advisor Name, Invoice Number, Invoice >> Date, License No, Mileage etc.. >> >> I have tried to extract the whole data from the image to a text file. By >> doing some pre-processing on the image using Imagemagick, I was able to >> extract the info to some extent. However, I am not totally satisfied with >> the output. >> I need your inputs on how I should extract the information. Shall I first >> crop the specific portion of the image to different rectangles and then OCR >> them individually..? I tried this way and gained great results. But again >> in this case, not all the images are in the same size with same resolution >> and hence the rectangles co-ordinates will not work on all the cases. I >> thought this method will not work on all images (scanned, taken from mobile >> or pdf files). >> >> Then I thought of using Regular expressions on the extracted data and >> then pick up the data that I require from the whole text file. But this >> method also does not seem to be working. >> >> I am totally in a confused state now. Any help or inputs are much >> appreciated. .. :) I have attached a sample image and the extracted output. >> >> Thanks, >> Vinay. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/06659594-efd0-4d36-a2a0-144d5ef63968%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.