On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote:
On Wednesday 04 January 2006 07:34, Dave Kor wrote:
Hi,
I would like to associate information (or labels) with each word
or a
range of words in a document. Information such as this word is a
noun, that
word is a verb, this period marks the end of a sentence, "kick the
bucket"
is a contiguous phrase, "white house" is a location and so on. I
am seeking
a good representation for such information so that they can be
easily stored
as additional fields in a lucene document, and easily recovered
after a
search. For the more technically inclined, this would allow me to
store
part-of-speech tags, chunk tags, sentence boundary markers and
parse trees
for every indexed document.
These additional information will enable Lucene to perform
additional
post-processing on retrieved documents for various purposes such as
information extraction, summarization, question answering, etc...
Is there
any available api? If not, I would appreciate any suggestions and
tips on
how such information can best be stored in a Lucene document.
Basically, the index information available in Lucene is the Term,
which is a
combination of a field name and a token. For these Lucene indexes
document presence and all positions within a document. Lucene also
indexes the field length as a norm.
By using one ore more extra fields the tags and sentence boundary
markers
can be easily indexed at their positions. To search these have a
look at the
span package.
In case you want to search for tokens combined with some (part of
speech)
tag, and the tokens and their tags are in different fields, the
span package
is not sufficient, because it does not allow position search over
different
fields.
Paul - I'm interested in this topic myself. Suppose the "text" field
is indexed but also entities are detected like names and places.
Suppose I'd like a query that was "all names that have the initials
EH in the text field" (where we could identify EH names by doing a
SpanRegexQuery for "E.* H.*".
I've been pondering whether it makes sense for Lucene to be enhanced
to carry over a Token's type into the index such that it could factor
into the query also.
Thoughts?
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]