Re: Good representation for part-of-speech, chunk, sentence boundary tags?

Erik Hatcher Wed, 04 Jan 2006 05:15:04 -0800


On Jan 4, 2006, at 7:53 AM, Paul Elschot wrote:

On Wednesday 04 January 2006 07:34, Dave Kor wrote:
Hi,
I would like to associate information (or labels) with each wordor arange of words in a document. Information such as this word is anoun, thatword is a verb, this period marks the end of a sentence, "kick thebucket"is a contiguous phrase, "white house" is a location and so on. Iam seekinga good representation for such information so that they can beeasily storedas additional fields in a lucene document, and easily recoveredafter asearch. For the more technically inclined, this would allow me tostorepart-of-speech tags, chunk tags, sentence boundary markers andparse trees
for every indexed document.
These additional information will enable Lucene to performadditional
post-processing on retrieved documents for various purposes such as
information extraction, summarization, question answering, etc...Is thereany available api? If not, I would appreciate any suggestions andtips on
how such information can best be stored in a Lucene document.
Basically, the index information available in Lucene is the Term,which is a
combination of a field name and a token. For these Lucene indexes
document presence and all positions within a document.  Lucene also
indexes the field length as a norm.
By using one ore more extra fields the tags and sentence boundarymarkerscan be easily indexed at their positions. To search these have alook at the
span package.
In case you want to search for tokens combined with some (part ofspeech)tag, and the tokens and their tags are in different fields, thespan packageis not sufficient, because it does not allow position search overdifferent
fields.

Paul - I'm interested in this topic myself. Suppose the "text" fieldis indexed but also entities are detected like names and places.Suppose I'd like a query that was "all names that have the initialsEH in the text field" (where we could identify EH names by doing aSpanRegexQuery for "E.* H.*".

I've been pondering whether it makes sense for Lucene to be enhancedto carry over a Token's type into the index such that it could factorinto the query also.


Thoughts?

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Good representation for part-of-speech, chunk, sentence boundary tags?

Reply via email to