Lucene Analyzer that can handle C++ vs C#

2009-12-11 Thread maxSchlein

Can someone please point me in the right direction.

We are creating an application that needs to beable to search on C++ and get
back doc's that have C++ in it.  The StandardAnalyzer does not seem to index
the "+", so a search for "C++" will bring back docs that contain, C++, C,
C#, etc.  The WhiteSpaceAnalyzer will index the "+", but if we have the
term "C++." that is, if C++ is at the end of a sentence, it will index
"C++." so a search for "C++" will not return the doc.  I have heard of maybe
a CustomAnalyzer; however, it seems like there would actually need to be a
CustomFilter/CustomTokenizer, I looked at:
 - StandardAnalyzer.java
 - StandardFilter.java
 - StandardTokenizer.java
 - StandardTokenizerImpl.java
 - StandardTokenizerImpl.jflex

I would guess that the StandardTokenizer is where the changes would need to
be made to allow the "+" character, but I am unclear as to how.

Any and all help is greatly appreciated.

Going thru all the documents, stripping out "+" for the word "plus" is not
really an option for us. 
-- 
View this message in context: 
http://old.nabble.com/Lucene-Analyzer-that-can-handle-C%2B%2B-vs-C--tp26748041p26748041.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Analyzer that can handle C++ vs C#

2009-12-24 Thread maxSchlein

Here is the solution.  I used a CustomAnalyzer that calls CustomFilter.  

Easy enough, but now if I want to use the current version of lucene, 3.0
these methods are no longer there.  TokenStream.next() or
TokenStream.next(Token).  In 2.9.0 these methods were deprecated as are
Token.setTermText() and Token.termText().  The newer versions say to use,
incrementToken() and AttributeSource APIs.  But I cannot find much help
using these in this way.  Any help again is appreciated.

Merry Christmas too.

public class CustomAnalyzer extends Analyzer
{
@Override
public TokenStream tokenStream(final String fieldName, final Reader
reader)
{
TokenStream ts = new WhitespaceTokenizer(reader);
ts = new StopFilter(false, ts, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
ts = new LowerCaseFilter(ts);
ts = new CustomFilter(ts);

return ts;
}

}

public class CustomFilter extends TokenFilter
{
protected CustomFilter(TokenStream tokenStream)
{
super(tokenStream);
}
@Override
public Token next(final Token reusableToken) throws IOException
{
Token nextToken = input.next(reusableToken);

if(nextToken != null)
{
   
nextToken.setTermText(nextToken.termText().replaceAll(":|,|\\(|\\)|“|~|;|&|\\.",""));
}
return nextToken;
}
}



maxSchlein wrote:
> 
> Can someone please point me in the right direction.
> 
> We are creating an application that needs to beable to search on C++ and
> get
> back doc's that have C++ in it.  The StandardAnalyzer does not seem to
> index
> the "+", so a search for "C++" will bring back docs that contain, C++, C,
> C#, etc.  The WhiteSpaceAnalyzer will index the "+", but if we have
> the
> term "C++." that is, if C++ is at the end of a sentence, it will index
> "C++." so a search for "C++" will not return the doc.  I have heard of
> maybe
> a CustomAnalyzer; however, it seems like there would actually need to be a
> CustomFilter/CustomTokenizer, I looked at:
>  - StandardAnalyzer.java
>  - StandardFilter.java
>  - StandardTokenizer.java
>  - StandardTokenizerImpl.java
>  - StandardTokenizerImpl.jflex
> 
> I would guess that the StandardTokenizer is where the changes would need
> to
> be made to allow the "+" character, but I am unclear as to how.
> 
> Any and all help is greatly appreciated.
> 
> Going thru all the documents, stripping out "+" for the word "plus" is not
> really an option for us. 
> 

-- 
View this message in context: 
http://old.nabble.com/Lucene-Analyzer-that-can-handle-C%2B%2B-vs-C--tp26748041p26915539.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



help customfilter with incrementToken() and AttributeSource APIs

2009-12-24 Thread maxSchlein

In the current version of lucene, 3.0 the following methods are no longer
available.  
   - TokenStream.next() 
   - TokenStream.next(Token).  
   - Token.setTermText() 
   - Token.termText().  

The newer versions says to use, incrementToken() and AttributeSource APIs. 
But I cannot find much help using these in this way.  Any help again is
appreciated.  If anyone has a basic example, or can point me to something
useful that would be awesome.  Thanx.

Merry Christmas too.

public class CustomAnalyzer extends Analyzer
{
@Override
public TokenStream tokenStream(final String fieldName, final Reader
reader)
{
TokenStream ts = new WhitespaceTokenizer(reader);
ts = new StopFilter(false, ts, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
ts = new LowerCaseFilter(ts);
ts = new CustomFilter(ts);

return ts;
}

}

public class CustomFilter extends TokenFilter
{
protected CustomFilter(TokenStream tokenStream)
{
super(tokenStream);
}
@Override
public Token next(final Token reusableToken) throws IOException
{
Token nextToken = input.next(reusableToken);

if(nextToken != null)
{
   
nextToken.setTermText(nextToken.termText().replaceAll(":|,|\\(|\\)|“|~|;|&|\\.",""));
}
return nextToken;
}
}
-- 
View this message in context: 
http://old.nabble.com/help-customfilter-with-incrementToken%28%29-and-AttributeSource-APIs-tp26915600p26915600.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Analyzer that can handle C++ vs C#

2009-12-24 Thread maxSchlein

That is awesome, just one thing, and forgive me if i sound ignorant.  What is
"FastZemberek zemberek"?

Ahmet Arslan wrote:
> 
> 
>> public class CustomFilter extends TokenFilter
>> {
>>     protected CustomFilter(TokenStream
>> tokenStream)
>>     {
>>         super(tokenStream);
>>     }
>>     @Override
>>     public Token next(final Token reusableToken)
>> throws IOException
>>     {
>>         Token nextToken =
>> input.next(reusableToken);
>>         
>>         if(nextToken != null)
>>         {
>>            
>> nextToken.setTermText(nextToken.termText().replaceAll(":|,|\\(|\\)|“|~|;|&|\\.",""));
>>         }
>>         return nextToken;
>>     }
>> }
> 
> Here is the the one that uses new token stream api: 
> 
> public final class CustomFilter extends TokenFilter {
> 
>private final TermAttribute termAtt;
> 
> public CustomFilter(TokenStream in, FastZemberek zemberek) {
> super(in);
> termAtt = (TermAttribute) addAttribute(TermAttribute.class);
> }
> 
> public final boolean incrementToken() throws IOException {
> if (input.incrementToken()) {
> String term = termAtt.term();
> String s = term.replaceAll(":|,|\\(|\\)|“|~|;|&|\\.","");
> if (s != null && !s.equals(term))
> termAtt.setTermBuffer(s);
> return true;
> } else {
> return false;
> }
> }
> }
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Lucene-Analyzer-that-can-handle-C%2B%2B-vs-C--tp26748041p26918236.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Text extraction from ms word doc

2010-01-11 Thread maxSchlein

I was looking for an option for Text extraction from a word doc.  

Currently I am using POI; however, when there is a table in the doc, for
each column POI brings back a .  The whitespace analyzer is not filtering
out this character.  So whatever word or phrase that is the last word or
phrase within a table column is not found during searching.  That is, if the
word dog is the only word in a column, a search for the word dog would
return nothing because the word that was indexed was "dog".

I can create a filter to fix this, using Apache's
StringUtils.isAsciiPrintable, but I would rather not.

Any and all help is welcome and thanked.
-- 
View this message in context: 
http://old.nabble.com/Text-extraction-from-ms-word-doc-tp27116739p27116739.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Controlling what is indexed / normalizing our index

2010-02-15 Thread maxSchlein

We have a list of keywords with aliases (Example:  keyword = "ms access"
aliases = "microsoft access", "msaccess", "m.s. access"  )

We would like to intercept the aliases prior to them being indexed, and have
the keyword indexed instead.  We can do this with a CustomFilter for single
word aliases.  (Example: in filter token = "access", we change value to
"msaccess").  Our problem is when the token equals microsoft, we need to
find out if the next token is access or not, that is, does it match one of
our aliases.

Has anyone had an issue like?  Any and all help is appreciated.  Thanx.
-- 
View this message in context: 
http://old.nabble.com/Controlling-what-is-indexednormalizing-our-index-tp27600274p27600274.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org