Dear NGuyenQ, > From the page http://www.pixel-technology.com/freeware/tessnet2/ > tessnet2.Tesseract ocr = new tessnet2.Tesseract(); > ocr.SetVariable("tessedit_char_whitelist", "0123456789"); // If digit only This is brilliant advice you just gave him. It is very effective, i just tested it on document with only digits and a few special characters. Since i'm working with C++ only (No .net wrapper), here is what i recommend to do:
// Init your tess API. _tessApi = new tesseract::TessBaseAPI(); // Set up the current directory and language prefix. _tessApi->Init("./", "cst"); // This is only important if you'll be parsing pictures with only one line of text (Which is my case). _tessApi->SetPageSegMode(tesseract::PSM_SINGLE_LINE); // Here is the trick as explained and pointed by NGuyenQ: _tessApi->SetVariable("tessedit_char_whitelist", "<0123456789"); // The in a loop for each of my documents, here is the idea: PIX *pix = pixReadMemTiff((const l_uint8*)buffer.buffer().constData(), buffer.size(), 0); _tessApi->SetImage(pix); doc.setRecognizedData("OCRLine", QString(text).trimmed()); pixDestroy(&pix); delete [] text; delete pix; // Release everything. _tessApi->Clear(); _tessApi->End(); delete _tessApi; The very very interesting part is that before, i was getting "D" and "O" instead of zeros, sometimes even "A" for "4" and "[]" and "[)" instead of zeroes, despite my disambiguation file. Now, i'm getting everything correct, which means the whitelist / blacklist are not just post-processing filters, but real "recognition clues". i recommend everyone to take note (Well... i'm discovering this feature and it's real consequences, maybe you're not :D). Pierre. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-...@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.