Re: Indexing/Querying Annotations and Fields for a document

lucene-seme1 s Tue, 18 Mar 2008 12:09:46 -0700

Can you please share the custom Analyzer you have ? In particular, I am
interested in knowing how to get access to the position, offset values for
each token.


Regards,
JK

On Tue, Mar 18, 2008 at 10:48 AM, mark harwood <[EMAIL PROTECTED]>
wrote:

> I've used a custom analyzer before now to "blend in" GATE annotations as
> tokens at the same position as the words they relate to.
>
> E.g.
>    Fred Smith works for Microsoft
>
> would be tokenized ordinarily as the following tokens:
>
>    position    offset    text
>    ======    ===    ===
>    1            0        fred
>    2            6        smith
>    3            13      works
>    ....
> But in a custom analyzer you would know the offsets of all these normal
> tokens plus have visibility of the GATE annotations, including offsets. Your
> custom analyzer can blend these to produce as follows:
>
>    position    offset    text
>    ======    ===    ===
>    1            0        fred
>    1            0        GATE_PERSON
>    2            6        smith
>    3            13      works
>
> The trick to adding "GATE_PERSON" at the same position as "fred" is to set
> the "position increment" of this token to zero.
>
> Now you can construct a Lucene query that uses this position info in
> queries.
> i.e. instead of searching for the specific:
>
>    "Fred works for Microsoft"~5
>
> you can now search for the more general:
>
>    "GATE_PERSON works for microsoft"~5
>
> The GATE tokens e.g. "GATE_PERSON" would have to be terms you wouldn't
> expect to find in normal text so they wouldn't clash.
> Another way of doing this which avoids this problem might be to look at
> the new payloads API.
> Anyone care to wade in with if this is feasible and the state of play with
> payloads?
>
> Cheers
> Mark
>
>
> ----- Original Message ----
> From: Grant Ingersoll <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 18 March, 2008 12:24:02 AM
> Subject: Re: Indexing/Querying Annotations and Fields for a document
>
> You would parse the XML (or whatever) into separate strings, and put
> each piece into it's own Field in a Lucene Document
>
> For instance:
>
> Document doc = new Document();
> String body = getBody(input);
> String people = getPeople(input)
> Field body = new Field("body", body);
> Field people = new Field("people", people);
>
> writer.addDocument(doc)
>
>
> Essentially, you just need to implement the getPeople and getBody
> methods to extract the appropriate content from your text.
>
>
> On Mar 17, 2008, at 5:05 PM, lucene-seme1 s wrote:
>
> > I already have the document preprocessed and the annotations (i.e.
> > <Person>John</Person>) are already stored in an array with features
> > attached
> > to some annotations (such as the root and lemma of the word). Can
> > you please
> > elaborate some more on how to "index them as normally would" ?
> >
> > Regards,
> > JK
> >
> >
> > On Mon, Mar 17, 2008 at 4:33 PM, Grant Ingersoll <[EMAIL PROTECTED]>
> > wrote:
> >
> >> I think there are a couple of ways you can approach this, although I
> >> have never used GATE.
> >>
> >> If these annotations are marked in line in your content, then you can
> >> either preprocess the files to have them separately and index as you
> >> normally would, or you can use the relatively new TeeTokenFilter and
> >> SinkTokenizer to extract them as you go for use in other fields.  I
> >> have done this successfully for some apps that I have worked on and I
> >> think it works quite nice and beats preprocessing IMO.  Essentially,
> >> you set up a TeeTokenFilter that recognizes your Person and then set
> >> that token aside in the Sink.  Then, when you construct the Person
> >> field, you use the SinkTokenizer.
> >>
> >> HTH,
> >> Grant
> >>
> >> On Mar 17, 2008, at 8:54 AM, lucene-seme1 s wrote:
> >>
> >>> Hello,
> >>>
> >>> I am a newbie here and still experimenting with Lucene. I have
> >>> annotations
> >>> and features generated by GATE for many documents and would like to
> >>> index
> >>> the original content of the documents in addition to the generated
> >>> annotations. The annotations are in the form of [<Person> John </
> >>> Person>
> >>> loves fishing]. I would like to be able to search using the Person
> >>> attribute.
> >>>
> >>> Any hint or suggestion is highly appreciated
> >>>
> >>> regards,
> >>> JK
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucenebootcamp.com
> >> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
> >>
> >> Lucene Helpful Hints:
> >> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
>
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
>
>       ___________________________________________________________
> Rise to the challenge for Sport Relief with Yahoo! For Good
>
> http://uk.promotions.yahoo.com/forgood/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Indexing/Querying Annotations and Fields for a document

Reply via email to