On 18/10/2011 15:25, Steven A Rowe wrote:
Hi Paul,
On 10/18/2011 at 4:57 AM, Paul Taylor wrote:
On 18/10/2011 06:19, Steven A Rowe wrote:
Another option is to create a char filter that substitutes
PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods,
etc.,
Yes that is how I first did it
No, I don't think you did. When I say "char filter" I'm referring to
CharFilter<http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/analysis/CharFilter.html>
- this is a different kind of thing from the token filter approach you described taking
previously.
If you look at the code you can see I do use a CharFilter:
NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
specialcharConvertMap.add("!", "Exclamation");
specialcharConvertMap.add("?","QuestionMark");
...............
public TokenStream tokenStream(String fieldName, Reader reader) {
CharFilter specialCharFilter = new
MappingCharFilter(specialcharConvertMap,reader);
StandardTokenizer tokenStream = new
StandardTokenizer(LuceneVersion.LUCENE_VERSION);
try
{
if(tokenStream.incrementToken()==false)
{
tokenStream = new
StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
}
else
{
//TODO **************** set tokenstream back as it was
before increment token
}
}
catch(IOException ioe)
{
}
TokenStream result = new LowercaseFilter(result);
return result;
}
If you go with a CharFilter, you can give it access to the entire input at
once, and use a regular expression (or something like it) to assess the input
and then behave accordingly.
Steve
Well this is the problem, you cant use a regular expression or even if
you did would that really slow things down wouldn't it, seeing as 99%
dont need the transformation.
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org