On Wednesday 04 January 2006 14:14, Erik Hatcher wrote: > > On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote: > > > On Wednesday 04 January 2006 07:34, Dave Kor wrote: > >> Hi, > >> ... > >> > >> These additional information will enable Lucene to perform > >> additional > >> post-processing on retrieved documents for various purposes such as > >> information extraction, summarization, question answering, etc... > >> Is there > >> any available api? If not, I would appreciate any suggestions and > >> tips on > >> how such information can best be stored in a Lucene document. > > > > Basically, the index information available in Lucene is the Term, > > which is a > > combination of a field name and a token. For these Lucene indexes > > document presence and all positions within a document. Lucene also > > indexes the field length as a norm. > > By using one ore more extra fields the tags and sentence boundary > > markers > > can be easily indexed at their positions. To search these have a > > look at the > > span package. > > In case you want to search for tokens combined with some (part of > > speech) > > tag, and the tokens and their tags are in different fields, the > > span package > > is not sufficient, because it does not allow position search over > > different > > fields. > > Paul - I'm interested in this topic myself. Suppose the "text" field > is indexed but also entities are detected like names and places. > Suppose I'd like a query that was "all names that have the initials > EH in the text field" (where we could identify EH names by doing a > SpanRegexQuery for "E.* H.*". > > I've been pondering whether it makes sense for Lucene to be enhanced > to carry over a Token's type into the index such that it could factor > into the query also. > > Thoughts?
When the token type is a (small) integer, it could be stored in addition to the indexed position as additional information in the proximity index. The document level index could also be extended to to be able to find the documents containing this typed token. Part of speech tags would map nicely to this token type. In the (recent) past there has been talk of structured fields and sentence/paragraph level fields, but I don't recall details of intended implementations. Sentence/paragraph level fields are probably better handled by extra proximity indexes for the same field, ie. without repeating the term index. For this the document level index would need a pointer into each proximity index, currently it has only one pointer into the only proximity index. When there are only a few types/tags, each of them could be mapped to a level like this. So there are (at least) two independent possibilities: - store type/tag info in the proximity index, and maybe allow searching this on document level. - add more proximity indexes for the same field. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]