So I've been thinking about this more and what seems most plausible is to just use Store.NO for the delimiters, so they will have payloads encoded correctly but not affect the stored data, and store separate instances of the subId information which can be retrieved along the boundaries. There should be on average 3 subId's, so I don't think this would be too big of a performance hit to extract the separate field instances.
What I find myself wanting to do is to be able to pass MetaData along with the Field() when it is added to the document. Then be able to retrieve this metaData information while inside the TokenFilter. I guess this would be similar to adding column stride fields, but have multiple ones at different positions in the document. -----Original Message----- From: Beard, Brian [mailto:brian.be...@mybir.com] Sent: Thursday, August 19, 2010 2:02 PM To: java-user@lucene.apache.org Subject: Tokenization / Analyzer question I'm using lucene 2.9.1. I'm indexing documents which correspond to an ID. Each field in the ID document is made up of data from all subId's. (It's a requirement that searches must work across all subId's within an ID). They will be indexed and stored in some format similar to: subId0Value0 subId0Value1 DELIMITER subId1Value0 subId1Value1 DELIMITER .... This is so the stored value strings can be matched up to the correct subId for display purposes. In addition, there may be more than one type of delimiter to denote the type of data within a subId's data. Moreover, for some of the fields it is desired to be able to tell which subId the search terms corresponded to. What I'd like to do is use payloads to encode the delimiter type into the payload, but use the same text string for all delimiter types. Also, the delimiter tokens shouldn't affect the search because they will be filtered out of the analyzer. This is so that during a post-processing step of the returned results, all payloads could be extracted at once by a single term, instead of having to loop through them term by term (because I don't see a way to get them on a per-document basis - otherwise I could add a payload to the first term when the subId changes during indexing). Then the search term positions could be checked against the delimiter positions to determine the record and type of data. I've got an example performing the tokenization and payload encoding correctly, but the stored values are not the ones I want. What I've done is write a BoundaryTokenFilter class which always is instantiated *after* the standardAnalyzer, or WhitespaceAnalyzer, etc - new BoundaryTokenFilter(new StandardAnalyzer().tokenStream()). In order to get the tokens passed down to BoundaryTokenFilter and pass through the other tokenizers/TokenFilters higher up the chain, I'm using character based delimiters (Different ones for each delimiter that also have some additional encoded information for the payload). Then inside of BoundaryTokenFilter, the character based delimiter is decoded. Then the token is changed over to the single delimiter value and the payload added. This part works, but the problem I'm having is that I would like the delimiter values stored in the index to be the same as the single delimiter tokenized ones. Is there any way to do this - transform the stored values to be the tokenized values, but only for select tokens? The other alternative I can think of if that isn't possible, is to modify standardAnalyzer to have indexing and search modes, so that during indexing mode it would allow the delimiter tokens to pass through and flag them using a typeAttribute, while they don't in search mode. For this though I would most likely have to end up using different delimiter values. Any help is appreciated, Brian Beard --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org