Hi:

this is slightly long.....
I am posting this question on this list as there seem to be people of 
diverse backgrounds here...,
and hopefully someone can come up with an idea....or a solution...!

This for a legislative automation project ;  We are in the process of 
converting a corpus of legislative acts and bills into 
digital format.

We want to scan all the documents into tiff format.  The tiff image will 
then be OCRed, and after a series of verifications 
aimed at increasing OCR accuracy, we will end up with a bi-tonal (JBig2) 
Searchable Image pdf...Hopefully at the end 
of that, the pdf will show & print the text of the document as 100% true 
copy of the paper document.

The two main problems i cant get a handle on is ascertaining the accuracy 
of the OCR data:

Manual verification seems to be what many of the outsourcing companies 
do....but we dont want to go this way as it is quite
expensive (are I am talking of digitizing legislative acts for over 5 
parliaments...)... and to get better accuracy you need to do a 
triple-compare (a.k.a keying in the same information three times....).  I 
have been trying to look for a "brighter" solution along 
the lines of...:

a) we have a TIFF image of the document scanned and virtually re-scanned 
(a technology that cleans up crimps, creases 
and ink blothces on the scanned image)....this is a true 100% copy of the 
document.

b) then after the OCR process on the TIFF we have a text-over-image 
pdf....which is searchable for the text  in the document, 
and also highlights the text (like google search highlights) within the 
pdf document....

the question is, how do we verify the accuracy this text-over-image pdf 
(b) with the 100% true TIFF created in (a)....?

clearly a plain image comparison will generate a huge amount of 
difference...so is there a way to do it more smartly ?

One possible solution, is to use the JBIG2 format that (used by Adobe to 
compress and store text images inside a pdf). 
It seems that this format, actually builds a table with the shape of the 
letters storing the shape only once per page. 
So we  would like to build a verification process that takes the original 
data image (a) and compares it to the optically 
recognized  data that has been transformed back into an image (b) . The 
application should ...somehow compare the 
"patterns"/"shapes" (thereby eliminating the "noise" of a pixel by pixel 
kind of comparison) which will point out the parts that 
show a mismatch for visual interpretation  and verification by experienced 
operators.....

does it make any sense?

If someone has the domain experience in doing something like this or has 
the fundamentals of a software which can do this....
Please get in touch with me!!!!


thanks

Ashok







Reply via email to