Hi,
 
I am involved in a project which is trying to provide searching and hit 
highlighting on the scanned image of historical newspapers.  We have an XML 
based OCR format.  A sample is below.  We need to index the CONTENT attribute 
of the String element which is the easy part.  We would like to be able find 
the "hits" within this XML document in order to use the positioning information 
to draw the highlight boxes on the image.  It doesn't make a lot of sense to 
just extract the CONTENT and index that because we loose the positioning 
information.  My second thought was to make a custom analyzer which dropped 
everything except for the content element and then used the highlighting class 
in the sandbox to reanalyze the XML document and mark the hits.  With the 
marked hits in the XML we could find the position information and draw on the 
image.  Has anyone else worked with OCR information and lucene.  What was your 
approach?  Does this approach seem sound?  Any recommendations?
 
Thanks, Corey
 
     <TextLine HEIGHT="2307.0" WIDTH="2284.0" HPOS="1316.0" VPOS="123644.0">
      <String STYLEREFS="ID4" HEIGHT="1922.0" WIDTH="244.0" HPOS="1316.0" 
VPOS="123644.0" CONTENT="The" WC="1.0"/>
      <SP WIDTH="-244.0" HPOS="1560.0" VPOS="123644.0"/>
      <String STYLEREFS="ID4" HEIGHT="1914.0" WIDTH="424.0" HPOS="1664.0" 
VPOS="123711.0" CONTENT="female" WC="1.0"/>
      <SP WIDTH="184.0" HPOS="1480.0" VPOS="123644.0"/>
      <String STYLEREFS="ID4" HEIGHT="2174.0" WIDTH="240.0" HPOS="2192.0" 
VPOS="123711.0" CONTENT="lays" WC="1.0"/>
      <SP WIDTH="104.0" HPOS="2088.0" VPOS="123711.0"/>
      <String STYLEREFS="ID4" HEIGHT="1981.0" WIDTH="360.0" HPOS="2528.0" 
VPOS="123711.0" CONTENT="about" WC="1.0"/>
      <SP WIDTH="236.0" HPOS="2292.0" VPOS="123711.0"/>
      <String STYLEREFS="ID4" HEIGHT="1855.0" WIDTH="216.0" HPOS="3000.0" 
VPOS="123770.0" CONTENT="140" WC="1.0"/>
      <SP WIDTH="112.0" HPOS="2888.0" VPOS="123711.0"/>
      <String STYLEREFS="ID4" HEIGHT="1729.0" WIDTH="284.0" HPOS="3316.0" 
VPOS="124223.0" CONTENT="eggs" WC="1.0"/>
      <SP WIDTH="100.0" HPOS="3216.0" VPOS="123770.0"/>
     </TextLine>


Reply via email to