gt; > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> > > -Original Message-
> > > From: Vihari Piratla [mailto:viharipira...@gmail.com]
> > >
.com]
> > Sent: Monday, January 12, 2015 8:51 AM
> > To: java-user@lucene.apache.org
> > Subject: Custom tokenizer
> >
> > Hi,
> > I am trying to implement a custom tokenizer for my application and I have
> > few queries regarding the same.
> > 1. Is
iratla [mailto:viharipira...@gmail.com]
> Sent: Monday, January 12, 2015 8:51 AM
> To: java-user@lucene.apache.org
> Subject: Custom tokenizer
>
> Hi,
> I am trying to implement a custom tokenizer for my application and I have
> few queries regarding the same.
> 1. Is ther
Hi,
I am trying to implement a custom tokenizer for my application and I have
few queries regarding the same.
1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer)
the custom tokenizer and make it use this tokenizer instead of say
StandardTokenizer?
2. Why are analyzers such as
wrote:
Hi,
I have a requirement to write a custom tokenizer using Lucene framework.
My requirement is it should have capabilities to match multiple words as
one token. for example. When user passes String as International Business
machine logo or IBM logo it should return International Business
If you already know the set of phrases you need to detect then you can
use Lucene's SynonymFilter to spot them and insert a new token.
Mike McCandless
http://blog.mikemccandless.com
On Thu, Feb 20, 2014 at 7:21 AM, Benson Margulies wrote:
> It sounds like you've been asked to implement Named E
It sounds like you've been asked to implement Named Entity Recognition.
OpenNLP has some capability here. There are also, um, commercial
alternatives.
On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio wrote:
> On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar
> wrote:
>
> Hi,
>
> > My requirement
On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar wrote:
Hi,
> My requirement is it should have capabilities to match multiple words as
> one token. for example. When user passes String as International Business
> machine logo or IBM logo it should return International Business Machine as
> one tok
Hi,
I have a requirement to write a custom tokenizer using Lucene framework.
My requirement is it should have capabilities to match multiple words as
one token. for example. When user passes String as International Business
machine logo or IBM logo it should return International Business Machine
You can also string together one of a myriad of TokenFilters, see:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
I'd recommend spending some time on the admin/analysis page
to understand what all the combinations do. I'd also recommend
against dealing with punctuation etc by using wi
Hi;
Standard tokenizer includes of that bydefault:
StandardFilter, LowerCaseFilter and StopFilter
You can consider char filters. Did you read here:
https://cwiki.apache.org/confluence/display/solr/CharFilterFactories
Thanks;
Furkan KAMACI
2013/12/5
> Hi,
>
> I have used StandardAnalyzer in
Hi,
I have used StandardAnalyzer in my code and it is working fine. One of the
challenges that I face is the fact that, this Analyzer by default tokenizes on
some special characters such as hyphen, apart from the SPACE character.
I want to tokenize only on the SPACE character. Could you please
Hello,
I have created a custom Tokenizer and am trying to set and extract my own
positions for each Token using:
reusableToken.reinit(word.getWord(),tokenStart,tokenEnd);
later when querying my index using a SpanTermQuery the start() and end()
tags don't correspond to these values but se
Hello,
The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the
issue is already addressed and solved...
Regards Ard
>
> Thank you for the reply Ard,
>
> The tokens exist in the index and are returned accurately, except for
> the offsets. In this case I am not dealing with
The issue continues to exist with nightly 146 from Jul 10, 2007.
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/146/
Ard Schrijvers wrote:
Hello,
The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the
issue is already addressed and solved...
Regards Ard
Thank you for the reply Ard,
The tokens exist in the index and are returned accurately, except for
the offsets. In this case I am not dealing with the positions, so the
termvector is specified as using 'with_offsets'. I have left the term
position incrememt as its default. Looking at the exist
Hello,
> Hi,
> I am storing custom values in the Tokens provided by a Tokenizer but
> when retrieving them from the index the values don't match.
What do you mean by retrieving? Do you mean retrieving terms, or do you mean
doing a search with words you know that should be in, but you do not fi
Hi,
I am storing custom values in the Tokens provided by a Tokenizer but
when retrieving them from the index the values don't match. I've looked
in the LIA book but it's not current since it mentioned term vectors
aren't stored. I'm using Lucene Nightly 146 but the same thing has
happened with
Because I wanted to use the javaCC input code from Lucene. 99.99% of
what the standard parser did was VERY GOOD. having worked with
computer-generated compilers in the past, I realized that if I were to
modify the parser itself, I would eventually get into real trouble. So
I took the time to
Your problem is that StandardTokenizer doesn's fit your requirements.
Since you know how to implement a new one, just do it.
If you just want to modify StandardTokenizer, you can get the codes and
rename it to your class, then modify something that you dislike. I think
it's a so simple stuff, why
On Aug 29, 2006, at 7:12 PM, Mark Miller wrote:
2. The ParseException that is generated when making the
StandardAnalyzer must be killed because there is another
ParseException class (maybe in queryparser?) that must be used
instead. The lucene build file excludes the StandardAnalyzer
Parse
Bill Taylor wrote:
I have copied Lucene's StandardTokenizer.jj into my directory, renamed
it, and did a global change of the names to my class name, LogTokenizer.
The issue is that the generated LogTokenizer.java does not compile for
2 reasons:
1) in the constructor, this(new FastCharStream(
Bill Taylor wrote:
I have copied Lucene's StandardTokenizer.jj into my directory, renamed
it, and did a global change of the names to my class name, LogTokenizer.
The issue is that the generated LogTokenizer.java does not compile for
2 reasons:
1) in the constructor, this(new FastCharStream(
I have copied Lucene's StandardTokenizer.jj into my directory, renamed
it, and did a global change of the names to my class name,
LogTokenizer.
The issue is that the generated LogTokenizer.java does not compile for
2 reasons:
1) in the constructor, this(new FastCharStream(reader)); fails bec
Tucked away in the contrib section of Lucene (I'm using 2.0) there is
org.apache.lucene.index.memory.PatternAnalyzer
which takes a regular expression as and tokenizes with it. Would that help?
Word of warning... the regex determines what is NOT a token, not what IS a
token (as I remember),
Bill Taylor wrote:
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
I'm in a real rush here, so pardon my brevity, but. one of the
constructors for IndexWriter takes an Analyzer as a parameter, which
can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should
fix you
ri
On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote:
: Have a look at PerFieldAnalyzerWrapper:
:
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/
PerFieldAnalyzerWrapper.html
...which can be specified in the constructors for IndexWriter and
QueryParser.
As I understand
: Have a look at PerFieldAnalyzerWrapper:
:
http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
...which can be specified in the constructors for IndexWriter and
QueryParser.
-Hoss
--
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
I'm in a real rush here, so pardon my brevity, but. one of the
constructors for IndexWriter takes an Analyzer as a parameter, which
can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should
fix you
right up.
that almos
I'm in a real rush here, so pardon my brevity, but. one of the
constructors for IndexWriter takes an Analyzer as a parameter, which can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you
right up.
Same kind of thing for a Query.
Erick
On 8/29/06, Bill Taylor <[EM
ss this tokenstream through other filters you are
> > interested in */
> > }
> > }
> >
> > Krovi.
> >
> > -Original Message-
> > From: Bill Taylor [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, August 29, 2006 8:10 PM
> > To:
interested in */
}
}
Krovi.
-Original Message-
From: Bill Taylor [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 29, 2006 8:10 PM
To: java-user@lucene.apache.org
Subject: Installing a custom tokenizer
I am indexing documents which are filled with government jargon. As
one would expect
ubject: Installing a custom tokenizer
I am indexing documents which are filled with government jargon. As
one would expect, the standard tokenizer has problems with
governmenteese.
In particular, the documents use words such as 310N-P-Q as references
to other documents. The standard tokenizer break
I am indexing documents which are filled with government jargon. As
one would expect, the standard tokenizer has problems with
governmenteese.
In particular, the documents use words such as 310N-P-Q as references
to other documents. The standard tokenizer breaks this "word" at the
dashes so
34 matches
Mail list logo