I'm supposed to index documents which do not have all the information I need stored in the Metadata fields. I would like to extract the document title from the document body when the Title Metadata field contains no information. In addition, many of the documents contain a table with information on the document subject. One of the columns in the table is named 'Abstract:' and it indicates that the topic is specified in the neighbouring cell.
I would store the title and abstract in separate fields with the idea to have them stored for the search results presentation, but also to boost them so that the results become more relevant. First of all, I would like to ask if that is a good idea, especially since I do not know how exactly I would extract this information. As it is now, the title is in the first line of the parsed text, followed by _space_ and the contents of the next row. The same goes for abstract information, it is separated by _space_ from the contents of the next row. I.e.the stream goes like this: Let's say this is the title _space_ New Line text Abstract: This would be the paper subject _space_ new column I suppose that I should write a custom ContentHandler or modify the existing BodyContentHandler from SAX? If so, a couple of lines of code showing the direction to go would be of immense help. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org