Re: Storing Part of Speech information in Lucene Indices

mark harwood Wed, 12 Jul 2006 08:55:35 -0700

>>Appending POS to the terms will create post processing nightmare

I think you may have missed the subtle distinction between Grant's suggestion 
and mine.


His suggestion was to append your POS info to the source token - creating a 
single token which combined both the original content and your POS info.
My suggestion was to generate *two* tokens - one is the original source token, 
untouched, the other is the POS token but the trick is to set the "position 
increment" to 0 for the POS token. This ensures that the POS token is 
considered to occur at exactly the same location as the source token. The 
original source token is untampered with and has all the usual stats in the 
index.The POS token and/or source token can be used interchangeably in queries 
on the same field so you can use position info eg phrase or span queries.

Cheers
Mark



----- Original Message ----
From: Amit Kumar <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, 12 July, 2006 4:15:34 PM
Subject: Re: Storing Part of Speech information in Lucene Indices

We need to be able to search by word and POS and also have POS  
available for each occurrence.  Appending POS to the terms will  
create post processing nightmare to retrieve
term frequencies right? (I would have to add all the foo_NN and  
foo_ADJ etc.).

I can store the POS in a parallel field and access it via term  
vectors, but that wouldn't allow any kind of search on POS related  
fields right?  For example if I wanted to search for any
adjective with in 3 words of say a term or say If I wanted to get all  
the patterns that follow the sequence ADJ NN ADJ.

Let me look in the developer archives for the payload discussions,  
perhaps implementing that might satisfy my use cases.

Comments?

-Thanks
Amit



On Jul 12, 2006, at 6:39 AM, Grant Ingersoll wrote:

> Hi Amit,
>
> This is definitely something you can do.   What are your goals for  
> it?  Do you want to search by word and POS or do you just want POS  
> available for post processing?
>
> You could just append the POS tag onto the end of your token as it  
> gets indexed, something like foo_NN or foo_ADJ.  This approach may  
> mean you have to use prefix query when you want to search against  
> just "foo".    You could also have a parallel field to your main  
> field that stores the POS.  Then you could access it via the term  
> vectors array.
>
> Also, we have been discussing on the developers list on how to add  
> payloads to a posting (i.e. store related information at a position  
> in the index) similar to what Google discusses in their original  
> paper.  Unfortunately, this isn't implemented yet, but if you feel  
> like helping out, check out the discussion on the developer's list  
> (see Flexible Indexing).
>
> -Grant
>
> On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:
>
>> Hi,
>>
>> A new project that I am investigating lucene for needs the  Parts  
>> of speech information for the tokens. I can  get that
>> information using NLP techniques  (GATE etc.), by pre processing  
>> the documents but I would like to  store that
>> information in the Indices. Something along the lines of
>>
>> TermVectorOffsetInfo[?].getPartofSpeech();
>>
>> I am writing to ask for your advice, you can tell me I am b o n k  
>> e r s  or let me know where I should start digging :).
>> Is that a good idea? Or would it be just less trouble for me to  
>> store the offset information along with parts of speech
>> outside Lucene.
>>
>> Has anyone else done that?
>>
>> Best,
>> Amit
>>
>>
>> ps: Thank you for putting the LuceneInAction source online, it was  
>> a great help to see the CategorizerTest.java.
>> I am ordering my copy of the book tomorrow :)
>>
>> ---------------------------------------------------------
>> Amit Kumar
>> Research Programmer
>> The Graduate School of Library and Information Science
>> University of Illinois, Urbana Champaign IL, 61820
>> phone: 217-333-4118 fax: 217-244-3302
>> ---------------------------------------------------------
>>
>>
>>
>>
>>
>>
>
>
>
> --------------------------
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> 335 Hinds Hall
> Syracuse, NY 13244
> http://www.cnlp.org
>
> Voice: 315-443-5484
> Fax: 315-443-6886
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------








---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Storing Part of Speech information in Lucene Indices

Reply via email to