Re: token type question

Pierrick Brihaye Fri, 22 Apr 2005 00:34:33 -0700

Hi,

[EMAIL PROTECTED] a écrit :

Thanks Pierrick.
Are you say that I should construct Token in analyzer like
new Token ("chem_H2O", 100, 103, "chem");
note that chem_ is added prefix to H2O, and 100 to 103 is length of H2O rather than chem_H2O?

Well... 100 to 103 are offsets provided by the reader (an are thus usually offsets in the source file). These offsets may help you to make some computations but they are lost when the token is indexed.

I want to index H2O in a compound, say H2O-CH2. say I want a query to find out H2O in a compound. How can I do that?


Dirty solution : use wildcard queries (chem_H20-*).

Smart solution :

consider that you have "words", i.e. chem_H2O followed by chem_CH2 (strange enough ;-)... and make some phrase queries or MultiPhraseQueries. See :

http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/test/org/apache/lucene/search/TestPhraseQuery.java?rev=150740&view=markup
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/test/org/apache/lucene/search/TestMultiPhraseQuery.java?rev=150733&view=markup

Setting the phrase slop may help you.

You may also want to play with the position of the tokens to allow/prevent hits from your PhraseQuery. Give your Tokens a relevant positionIncrement. See : http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/test/org/apache/lucene/search/TestPositionIncrement.java?rev=150585&view=markup

Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
+33 (0)2 99 29 67 78

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: token type question

Reply via email to