RE: Tokenization / Analyzer question

Beard, Brian Fri, 20 Aug 2010 13:46:09 -0700

So I've been thinking about this more and what seems most plausible is
to just use Store.NO for the delimiters, so they will have payloads
encoded correctly but not affect the stored data, and store separate
instances of the subId information which can be retrieved along the
boundaries. There should be on average 3 subId's, so I don't think this
would be too big of a performance hit to extract the separate field
instances.


What I find myself wanting to do is to be able to pass MetaData along
with the Field() when it is added to the document. Then be able to
retrieve this metaData information while inside the TokenFilter. I guess
this would be similar to adding column stride fields, but have multiple
ones at different positions in the document.

-----Original Message-----
From: Beard, Brian [mailto:brian.be...@mybir.com] 
Sent: Thursday, August 19, 2010 2:02 PM
To: java-user@lucene.apache.org
Subject: Tokenization / Analyzer question

I'm using lucene 2.9.1.

I'm indexing documents which correspond to an ID.
Each field in the ID document is made up of data from all subId's.
(It's a requirement that searches must work across all subId's within an
ID).

They will be indexed and stored in some format similar to:
subId0Value0 subId0Value1 DELIMITER subId1Value0 subId1Value1 DELIMITER
....
This is so the stored value strings can be matched up to the correct
subId for display purposes.
In addition, there may be more than one type of delimiter to denote the
type of data within a subId's data.

Moreover, for some of the fields it is desired to be able to tell which
subId the search terms corresponded to.

What I'd like to do is use payloads to encode the delimiter type into
the payload, but use the same text string for all delimiter types. Also,
the delimiter tokens shouldn't affect the search because they will be
filtered out of the analyzer. This is so that during a post-processing
step of the returned results, all payloads could be extracted at once by
a single term, instead of having to loop through them term by term
(because I don't see a way to get them on a per-document basis -
otherwise I could add a payload to the first term when the subId changes
during indexing). Then the search term positions could be checked
against the delimiter positions to determine the record and type of
data. 

I've got an example performing the tokenization and payload encoding
correctly, but the stored values are not the ones I want.

What I've done is write a BoundaryTokenFilter class which always is
instantiated *after* the standardAnalyzer, or WhitespaceAnalyzer, etc -
new BoundaryTokenFilter(new StandardAnalyzer().tokenStream()). 

In order to get the tokens passed down to BoundaryTokenFilter and pass
through the other tokenizers/TokenFilters higher up the chain, I'm using
character based delimiters (Different ones for each delimiter that also
have some additional encoded information for the payload). Then inside
of BoundaryTokenFilter, the character based delimiter is decoded. Then
the token is changed over to the single delimiter value and the payload
added.

This part works, but the problem I'm having is that I would like the
delimiter values stored in the index to be the same as the single
delimiter tokenized ones. 

Is there any way to do this - transform the stored values to be the
tokenized values, but only for select tokens?

The other alternative I can think of if that isn't possible, is to
modify standardAnalyzer to have indexing and search modes, so that
during indexing mode it would allow the delimiter tokens to pass through
and flag them using a typeAttribute, while they don't in search mode.
For this though I would most likely have to end up using different
delimiter values.

Any help is appreciated,

Brian Beard



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Tokenization / Analyzer question

Reply via email to