RE: Custom tokenizer

2015-01-12 Thread Uwe Schindler
gt; > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > > > -Original Message- > > > From: Vihari Piratla [mailto:viharipira...@gmail.com] > > >

Re: Custom tokenizer

2015-01-12 Thread Vihari Piratla
.com] > > Sent: Monday, January 12, 2015 8:51 AM > > To: java-user@lucene.apache.org > > Subject: Custom tokenizer > > > > Hi, > > I am trying to implement a custom tokenizer for my application and I have > > few queries regarding the same. > > 1. Is

RE: Custom tokenizer

2015-01-12 Thread Uwe Schindler
iratla [mailto:viharipira...@gmail.com] > Sent: Monday, January 12, 2015 8:51 AM > To: java-user@lucene.apache.org > Subject: Custom tokenizer > > Hi, > I am trying to implement a custom tokenizer for my application and I have > few queries regarding the same. > 1. Is ther

Custom tokenizer

2015-01-11 Thread Vihari Piratla
Hi, I am trying to implement a custom tokenizer for my application and I have few queries regarding the same. 1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer) the custom tokenizer and make it use this tokenizer instead of say StandardTokenizer? 2. Why are analyzers such as

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Ahmet Arslan
wrote: Hi, I have a requirement to write a custom tokenizer using Lucene framework. My requirement is it should have capabilities to match multiple words as one token. for example. When user passes String as International Business machine logo or IBM logo it should return International Business

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Michael McCandless
If you already know the set of phrases you need to detect then you can use Lucene's SynonymFilter to spot them and insert a new token. Mike McCandless http://blog.mikemccandless.com On Thu, Feb 20, 2014 at 7:21 AM, Benson Margulies wrote: > It sounds like you've been asked to implement Named E

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Benson Margulies
It sounds like you've been asked to implement Named Entity Recognition. OpenNLP has some capability here. There are also, um, commercial alternatives. On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio wrote: > On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar > wrote: > > Hi, > > > My requirement

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Yann-Erwan Perio
On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar wrote: Hi, > My requirement is it should have capabilities to match multiple words as > one token. for example. When user passes String as International Business > machine logo or IBM logo it should return International Business Machine as > one tok

Custom Tokenizer/Analyzer

2014-02-20 Thread Geet Gangwar
Hi, I have a requirement to write a custom tokenizer using Lucene framework. My requirement is it should have capabilities to match multiple words as one token. for example. When user passes String as International Business machine logo or IBM logo it should return International Business Machine

Re: Custom Tokenizer

2013-12-05 Thread Erick Erickson
You can also string together one of a myriad of TokenFilters, see: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters I'd recommend spending some time on the admin/analysis page to understand what all the combinations do. I'd also recommend against dealing with punctuation etc by using wi

Re: Custom Tokenizer

2013-12-05 Thread Furkan KAMACI
Hi; Standard tokenizer includes of that bydefault: StandardFilter, LowerCaseFilter and StopFilter You can consider char filters. Did you read here: https://cwiki.apache.org/confluence/display/solr/CharFilterFactories Thanks; Furkan KAMACI 2013/12/5 > Hi, > > I have used StandardAnalyzer in

Custom Tokenizer

2013-12-05 Thread raghavendra.k.rao
Hi, I have used StandardAnalyzer in my code and it is working fine. One of the challenges that I face is the fact that, this Analyzer by default tokenizes on some special characters such as hyphen, apart from the SPACE character. I want to tokenize only on the SPACE character. Could you please

TermPositions with custom Tokenizer

2009-10-01 Thread Christopher Tignor
Hello, I have created a custom Tokenizer and am trying to set and extract my own positions for each Token using: reusableToken.reinit(word.getWord(),tokenStart,tokenEnd); later when querying my index using a SpanTermQuery the start() and end() tags don't correspond to these values but se

RE: Token offset values for custom Tokenizer

2007-07-16 Thread Ard Schrijvers
Hello, The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the issue is already addressed and solved... Regards Ard > > Thank you for the reply Ard, > > The tokens exist in the index and are returned accurately, except for > the offsets. In this case I am not dealing with

Re: Token offset values for custom Tokenizer

2007-07-16 Thread Shahan Khatchadourian
The issue continues to exist with nightly 146 from Jul 10, 2007. http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/146/ Ard Schrijvers wrote: Hello, The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the issue is already addressed and solved... Regards Ard

Re: Token offset values for custom Tokenizer

2007-07-16 Thread Shahan Khatchadourian
Thank you for the reply Ard, The tokens exist in the index and are returned accurately, except for the offsets. In this case I am not dealing with the positions, so the termvector is specified as using 'with_offsets'. I have left the term position incrememt as its default. Looking at the exist

RE: Token offset values for custom Tokenizer

2007-07-16 Thread Ard Schrijvers
Hello, > Hi, > I am storing custom values in the Tokens provided by a Tokenizer but > when retrieving them from the index the values don't match. What do you mean by retrieving? Do you mean retrieving terms, or do you mean doing a search with words you know that should be in, but you do not fi

Token offset values for custom Tokenizer

2007-07-13 Thread Shahan Khatchadourian
Hi, I am storing custom values in the Tokens provided by a Tokenizer but when retrieving them from the index the values don't match. I've looked in the LIA book but it's not current since it mentioned term vectors aren't stored. I'm using Lucene Nightly 146 but the same thing has happened with

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
Because I wanted to use the javaCC input code from Lucene. 99.99% of what the standard parser did was VERY GOOD. having worked with computer-generated compilers in the past, I realized that if I were to modify the parser itself, I would eventually get into real trouble. So I took the time to

Re: Installing a custom tokenizer

2006-08-29 Thread yueyu lin
Your problem is that StandardTokenizer doesn's fit your requirements. Since you know how to implement a new one, just do it. If you just want to modify StandardTokenizer, you can get the codes and rename it to your class, then modify something that you dislike. I think it's a so simple stuff, why

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
On Aug 29, 2006, at 7:12 PM, Mark Miller wrote: 2. The ParseException that is generated when making the StandardAnalyzer must be killed because there is another ParseException class (maybe in queryparser?) that must be used instead. The lucene build file excludes the StandardAnalyzer Parse

Re: Installing a custom tokenizer

2006-08-29 Thread Mark Miller
Bill Taylor wrote: I have copied Lucene's StandardTokenizer.jj into my directory, renamed it, and did a global change of the names to my class name, LogTokenizer. The issue is that the generated LogTokenizer.java does not compile for 2 reasons: 1) in the constructor, this(new FastCharStream(

Re: Installing a custom tokenizer

2006-08-29 Thread Mark Miller
Bill Taylor wrote: I have copied Lucene's StandardTokenizer.jj into my directory, renamed it, and did a global change of the names to my class name, LogTokenizer. The issue is that the generated LogTokenizer.java does not compile for 2 reasons: 1) in the constructor, this(new FastCharStream(

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
I have copied Lucene's StandardTokenizer.jj into my directory, renamed it, and did a global change of the names to my class name, LogTokenizer. The issue is that the generated LogTokenizer.java does not compile for 2 reasons: 1) in the constructor, this(new FastCharStream(reader)); fails bec

Re: Installing a custom tokenizer

2006-08-29 Thread Erick Erickson
Tucked away in the contrib section of Lucene (I'm using 2.0) there is org.apache.lucene.index.memory.PatternAnalyzer which takes a regular expression as and tokenizes with it. Would that help? Word of warning... the regex determines what is NOT a token, not what IS a token (as I remember),

Re: Installing a custom tokenizer

2006-08-29 Thread Mark Miller
Bill Taylor wrote: On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote: I'm in a real rush here, so pardon my brevity, but. one of the constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you ri

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote: : Have a look at PerFieldAnalyzerWrapper: : http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/ PerFieldAnalyzerWrapper.html ...which can be specified in the constructors for IndexWriter and QueryParser. As I understand

Re: Installing a custom tokenizer

2006-08-29 Thread Chris Hostetter
: Have a look at PerFieldAnalyzerWrapper: : http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html ...which can be specified in the constructors for IndexWriter and QueryParser. -Hoss --

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote: I'm in a real rush here, so pardon my brevity, but. one of the constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you right up. that almos

Re: Installing a custom tokenizer

2006-08-29 Thread Erick Erickson
I'm in a real rush here, so pardon my brevity, but. one of the constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you right up. Same kind of thing for a Query. Erick On 8/29/06, Bill Taylor <[EM

Re: Installing a custom tokenizer

2006-08-29 Thread Ronnie Kolehmainen
ss this tokenstream through other filters you are > > interested in */ > > } > > } > > > > Krovi. > > > > -Original Message- > > From: Bill Taylor [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, August 29, 2006 8:10 PM > > To:

Re: Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
interested in */ } } Krovi. -Original Message- From: Bill Taylor [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 29, 2006 8:10 PM To: java-user@lucene.apache.org Subject: Installing a custom tokenizer I am indexing documents which are filled with government jargon. As one would expect

RE: Installing a custom tokenizer

2006-08-29 Thread Krovi, DVSR_Sarma
ubject: Installing a custom tokenizer I am indexing documents which are filled with government jargon. As one would expect, the standard tokenizer has problems with governmenteese. In particular, the documents use words such as 310N-P-Q as references to other documents. The standard tokenizer break

Installing a custom tokenizer

2006-08-29 Thread Bill Taylor
I am indexing documents which are filled with government jargon. As one would expect, the standard tokenizer has problems with governmenteese. In particular, the documents use words such as 310N-P-Q as references to other documents. The standard tokenizer breaks this "word" at the dashes so