Re: AlphaNumeric analyzer/tokenizer

2019-08-19 Thread Martin Grigorov
Hi, On Mon, Aug 19, 2019 at 9:31 AM Uwe Schindler wrote: > You already got many responses. Check you inbox. > "many" made me think that I've also missed something. https://markmail.org/message/ohv5qcvxilj3n3fb > > Uwe > > Am August 19, 2019 6:23:20 AM UTC schrieb Abhishek Chauhan < > abhishe

Re: AlphaNumeric analyzer/tokenizer

2019-08-18 Thread Uwe Schindler
You already got many responses. Check you inbox. Uwe Am August 19, 2019 6:23:20 AM UTC schrieb Abhishek Chauhan : >Hi, > >Can someone please check the above mail and provide some feedback? > >Thanks and Regards, >Abhishek > >On Fri, Aug 16, 2019 at 2:52 PM Abhishek Chauhan < >abhishek.chauhan...

Re: AlphaNumeric analyzer/tokenizer

2019-08-18 Thread Abhishek Chauhan
Hi, Can someone please check the above mail and provide some feedback? Thanks and Regards, Abhishek On Fri, Aug 16, 2019 at 2:52 PM Abhishek Chauhan < abhishek.chauhan...@gmail.com> wrote: > Hi, > > We have been using SimpleAnalyzer which keeps only letters in its tokens. > This limits us to se

RE: AlphaNumeric analyzer/tokenizer

2019-08-16 Thread Uwe Schindler
analysis chain. User PatternTokenizerFactory as tokenizer and add stuff like LowercaseFilterFactory and you are done. No need for any new components in Lucene. It's all there, RTFM 😊 https://lucene.apache.org/core/8_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html

AlphaNumeric analyzer/tokenizer

2019-08-16 Thread Abhishek Chauhan
Hi, We have been using SimpleAnalyzer which keeps only letters in its tokens. This limits us to search in strings that contains both letters and numbers. For e.g. "axt1234". SimpleAnalyzer would only enable us to search for "axt" successfully, but search strings like "axt1", "axt123" etc would giv

Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
Hi, I am looking for a tokenizer, where I could specify a delimiter by which the words are tokenized, for example if I choose the delimiters as ' ' and '_' the following string: "foo__bar doo" would be tokenized into: "foo", "", "bar",

Re: Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
dler > Achterdiek 19, D-28357 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Armins Stepanjans [mailto:armins.bagr...@gmail.com] > > Sent: Monday, January 8, 2018 2:09 PM > > To: java-user@lucene.apache.org

RE: Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Uwe Schindler
9, D-28357 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Armins Stepanjans [mailto:armins.bagr...@gmail.com] > Sent: Monday, January 8, 2018 2:09 PM > To: java-user@lucene.apache.org > Subject: Re: Looking For Tokenizer With Custom Deli

Re: Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
ucene-queryparser 7.1.0 Regards, Armīns On Mon, Jan 8, 2018 at 12:53 PM, Uwe Schindler wrote: > Moin, > > Plain easy to do customize with lambdas! E.g., an elegant way to create a > tokenizer which behaves exactly as WhitespaceTokenizer and LowerCaseFilter > i

RE: Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Uwe Schindler
Moin, Plain easy to do customize with lambdas! E.g., an elegant way to create a tokenizer which behaves exactly as WhitespaceTokenizer and LowerCaseFilter is: Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace, Character::toLowerCase); Adjust with Lambdas and

Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
Hi, I am looking for a tokenizer, where I could specify a delimiter by which the words are tokenized, for example if I choose the delimiters as ' ' and '_' the following string: "foo__bar doo" would be tokenized into: "foo", "", "bar",

Re: Email id tokenizer (actual email id & multiple terms)

2016-12-21 Thread Trejkaz
On Wed, Dec 21, 2016 at 11:23 PM, suriya prakash wrote: > Hi, > > Thanks for your reply. > > I might have one or more emailds in a single record. Just so you know, you can add the same field more than once with the field analysed by KeywordAnalyzer, and it will still become multiple tokens. This

Re: Email id tokenizer (actual email id & multiple terms)

2016-12-21 Thread suriya prakash
Hi, Thanks for your reply. I might have one or more emailds in a single record. So I have to index it with white space analyser after filtering emailid alone(may be using email id tokenizer). Tokenization will happen twice( for normal indexing and for special emailid field indexing) which is

Re: Email id tokenizer (actual email id & multiple terms)

2016-12-20 Thread Trejkaz
On Wed, Dec 21, 2016 at 1:21 AM, Ahmet Arslan wrote: > Hi, > > You can index whole address in a separate field. > Otherwise, how would you handle positions of the split tokens? > > By the way, speed of phrase search may be just fine, so consider trying first. Speed aside, phrase search is difficu

Re: Email id tokenizer (actual email id & multiple terms)

2016-12-20 Thread Ahmet Arslan
Hi, You can index whole address in a separate field. Otherwise, how would you handle positions of the split tokens? By the way, speed of phrase search may be just fine, so consider trying first. Ahmet On Tuesday, December 20, 2016 5:15 PM, suriya prakash wrote: Hi, I am using standard anal

Email id tokenizer (actual email id & multiple terms)

2016-12-20 Thread suriya prakash
Hi, I am using standard analyzer and want to split token for email_id " luc...@gmail.com" as "lucene", "gmail","com","luc...@gmail.com" in a single pass. I have already changed jflex to split email id as separate words(lucene, gmail, com). But we need to do phrase search which will not be efficie

Re: Exclusion List for standard tokenizer

2016-11-18 Thread lukes
Actually ClassicTokenizer seems to do the job. Any side effects of using ClassicTokenizer rather than StandardTokenizer ? Regards. -- View this message in context: http://lucene.472066.n3.nabble.com/Exclusion-List-for-standard-tokenizer-tp4306511p4306516.html Sent from the Lucene - Java Users

Exclusion List for standard tokenizer

2016-11-18 Thread lukes
27;. Is there a way i can provide that input to StandardTokenizer ? I tried to look into the source code, but seems to got lost. Any pointer is really appreciated. Regards. -- View this message in context: http://lucene.472066.n3.nabble.com/Exclusion-List-for-standard-tokenizer-tp4306511.htm

help camelcase tokenizer

2016-11-16 Thread Andres Fernando Wilches Riano
Hello I am indexing java source code files. I need to know how indexi or tokenize camel case words in identifiers, method names, clases , etc. e.g. getSystemRequirements. I am using lucene 3.0.1. Thank you, -- Atentamente, *Andrés Fernando Wilches Riaño* Ingeniero de Sistemas y Computación E

migrating custom analyzer/tokenizer (3.6-> 6.x)

2016-09-08 Thread Dirk Rothe
g approach with some ugly indirections: Capture the active reader in Analyzer.initReader() and access it via callback in the Tokenizer. class Tokenizer6(PythonTokenizer): def __init__(self, getReader): # callable for retrieving current reader self.getReader = getReader sel

identifier n-gram tokenizer

2016-01-11 Thread Michal Hlavac
Hello, I published some token filters that can be used to tokenize some kind of identifiers into punctation delimited n-grams (e.g. ip address). I think it needs some optimization, but it works for now. https://github.com/hlavki/lucene-analyzers You can find example of usage in unit test: https

Re: Tokenizer for Brown Corpus?

2015-02-24 Thread Koji Sekiguchi
hub.com/INL/BlackLab/wiki/Blacklab-query-tool -- Jack Krupansky On Tue, Feb 24, 2015 at 1:40 AM, Koji Sekiguchi wrote: Hello, Doesn't Lucene have a Tokenizer/Analyzer for Brown Corpus? There doesn't seem to be such tokenizers/analyzers in Lucene. As I didn't want re-inventing th

Re: Tokenizer for Brown Corpus?

2015-02-24 Thread Jack Krupansky
0 AM, Koji Sekiguchi wrote: > Hello, > > Doesn't Lucene have a Tokenizer/Analyzer for Brown Corpus? > There doesn't seem to be such tokenizers/analyzers in Lucene. > > As I didn't want re-inventing the wheel, so I googled, I got > the list of sn

Tokenizer for Brown Corpus?

2015-02-23 Thread Koji Sekiguchi
Hello, Doesn't Lucene have a Tokenizer/Analyzer for Brown Corpus? There doesn't seem to be such tokenizers/analyzers in Lucene. As I didn't want re-inventing the wheel, so I googled, I got the list of snippets that include "the quick br

Re: URL/Email tokenizer

2015-02-17 Thread Ian Lea
t;> > We have a requirement in that E-mail addresses need to be added in a >> > tokenized form to one field while untokenized form is added to another >> field >> > >> > Ex: >> > >> > "I have mailed a...@xyz.com" . It should tokenize

Re: URL/Email tokenizer

2015-02-17 Thread Ravikumar Govindarajan
gt; > tokenized form to one field while untokenized form is added to another > field > > > > Ex: > > > > "I have mailed a...@xyz.com" . It should tokenize as below > > > > body = {"I", "have", "mailed", "abc", &quo

Re: URL/Email tokenizer

2015-02-17 Thread Ian Lea
added to another field > > Ex: > > "I have mailed a...@xyz.com" . It should tokenize as below > > body = {"I", "have", "mailed", "abc", "xyz", "com"}; > > I also have a body-addr field. Tokenizer need

URL/Email tokenizer

2015-02-17 Thread Ravikumar Govindarajan
ot;, "xyz", "com"}; I also have a body-addr field. Tokenizer needs to extract e-mail addresses from body field and add them as below body-addr = {"a...@xyz.com"} How to achieve this via tokenizer chain? -- Ravi

RE: Custom tokenizer

2015-01-12 Thread Uwe Schindler
> Extending an existing Analyzer is not useful, because it is just a > > factory that returns a TokenStream instance to consumers. If you want > > to change the Tokenizer of an existing Analyzer, just clone it and > > rewrite its > > createComponents() method, see the examp

Re: Custom tokenizer

2015-01-12 Thread Vihari Piratla
mers. If you want to change the > Tokenizer of an existing Analyzer, just clone it and rewrite its > createComponents() method, see the example in the Javadocs: > http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/analysis/Analyzer.html > > If you want to add additional Tok

RE: Custom tokenizer

2015-01-12 Thread Uwe Schindler
Hi, Extending an existing Analyzer is not useful, because it is just a factory that returns a TokenStream instance to consumers. If you want to change the Tokenizer of an existing Analyzer, just clone it and rewrite its createComponents() method, see the example in the Javadocs: http

Custom tokenizer

2015-01-11 Thread Vihari Piratla
Hi, I am trying to implement a custom tokenizer for my application and I have few queries regarding the same. 1. Is there a way to provide an existing analyzer (say EnglishAnanlyzer) the custom tokenizer and make it use this tokenizer instead of say StandardTokenizer? 2. Why are analyzers such as

RE: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

2014-03-20 Thread Uwe Schindler
e.apache.org > Subject: Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1 > > Thanks Uwe. It worked. > > > > > On Thu, Mar 20, 2014 at 3:28 PM, Uwe Schindler wrote: > > > Hi, > > > > the IllegalStateException tells you what'

Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

2014-03-20 Thread Joe Wong
hi.de > > > > -Original Message- > > From: Joe Wong [mailto:jw...@adacado.com] > > Sent: Thursday, March 20, 2014 11:13 PM > > To: java-user@lucene.apache.org > > Subject: Re: Possible issue with Tokenizer in > lucene-analyzers-common-4.6.1 > >

RE: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

2014-03-20 Thread Uwe Schindler
ler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Joe Wong [mailto:jw...@adacado.com] > Sent: Thursday, March 20, 2014 11:13 PM > To: java-user@lucene.apache.org > Subject: Re: Possible issue with Tokenizer in

Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

2014-03-20 Thread Joe Wong
this.stopWords = Collections.EMPTY_SET; } public StemmingAnalyzer(Set stopWords) { this.stopWords = stopWords; } public StemmingAnalyzer(String... stopWords) { this.stopWords = Sets.newHashSet(stopWords); } @Override protected TokenStreamCompo

RE: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

2014-03-20 Thread Uwe Schindler
Hi Joe, in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional state machine checks to ensure that consumers and subclasses of those abstract interfaces are implemented in a correct way - they are not easy to understand, because they are implemented in that way to ensure they don&#

Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

2014-03-20 Thread Joe Wong
Hi We're planning to upgrade lucene-analyzers-commons 4.3.0 to 4.6.1 . While running our unit test with 4.6.1 it fails at org.apache.lucene.analysis.Tokenizer on line 88 (setReader method). There it checks if input != ILLEGAL_STATE_READER then throws IllegalStateException. Should it not be if inp

Re: Need help "teaching" Japanese tokenizer to pick up slangs

2014-03-10 Thread Me
Hi everybody UerDictionary is right. I am using yahoo Japanese tokenizer API (ę—„ęœ¬čŖžå½¢ę…‹ē“ č§£ęž) to teach my own user dictionary. http://developer.yahoo.co.jp/webapi/jlp/ On 2014/03/11, at 8:10, Rahul Ratnakar wrote: > Worked perfectly for Japanese. > > I have the same issue with Chinese An

Re: Need help "teaching" Japanese tokenizer to pick up slangs

2014-03-10 Thread Rahul Ratnakar
s()) >> > >> > new JapaneseAnalyzer(Version.LUCENE_46, null, >> JapaneseTokenizer.Mode.SEARCH, >> > JapaneseAnalyzer.getDefaultStopSet(), >> > JapaneseAnalyzer.getDefaultStopTags()) >> > >> > >> > >> > and none of them seem to tokenize the

Re: Need help "teaching" Japanese tokenizer to pick up slangs

2014-03-10 Thread Rahul Ratnakar
H, > > JapaneseAnalyzer.getDefaultStopSet(), > > JapaneseAnalyzer.getDefaultStopTags()) > > > > > > > > and none of them seem to tokenize the words as I want, so was wondering > if > > there is some way for me to actually "update" the dictionary/corpus so >

Re: Need help "teaching" Japanese tokenizer to pick up slangs

2014-03-10 Thread Robert Muir
update" the dictionary/corpus so that > these slangs are caught by the tokenizer as single word. > > > My example text has been scrapped from an "adult" website, so it might be > offensive and i apologize for that. A small excerpt from that website:- > > >

Re: Need help "teaching" Japanese tokenizer to pick up slangs

2014-03-10 Thread Rahul Ratnakar
seem to tokenize the words as I want, so was wondering if there is some way for me to actually "update" the dictionary/corpus so that these slangs are caught by the tokenizer as single word. My example text has been scrapped from an "adult" website, so it might be offensive and

Re: Need help "teaching" Japanese tokenizer to pick up slangs

2014-03-10 Thread Furkan KAMACI
Hi; Here is the page of it that has a online Kuromoji tokenizer and information: http://www.atilika.org/ It may help you. Thanks; Furkan KAMACI 2014-03-10 19:57 GMT+02:00 Rahul Ratnakar : > I am trying to analyze some japanese web pages for presence of slang/adult > phrases in them

Need help "teaching" Japanese tokenizer to pick up slangs

2014-03-10 Thread Rahul Ratnakar
I am trying to analyze some japanese web pages for presence of slang/adult phrases in them using lucene-analyzers-kuromoji-4.6.0.jar. While the tokenizer breaks up the word into proper words, I am more interested in catching the slangs which seems to result from combining various "safe&q

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Ahmet Arslan
wrote: Hi, I have a requirement to write a custom tokenizer using Lucene framework. My requirement is it should have capabilities to match multiple words as one token. for example. When user passes String as International Business machine logo or IBM logo it should return International Business

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Michael McCandless
If you already know the set of phrases you need to detect then you can use Lucene's SynonymFilter to spot them and insert a new token. Mike McCandless http://blog.mikemccandless.com On Thu, Feb 20, 2014 at 7:21 AM, Benson Margulies wrote: > It sounds like you've been asked to implement Named E

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Benson Margulies
It sounds like you've been asked to implement Named Entity Recognition. OpenNLP has some capability here. There are also, um, commercial alternatives. On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio wrote: > On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar > wrote: > > Hi, > > > My requirement

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Yann-Erwan Perio
On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar wrote: Hi, > My requirement is it should have capabilities to match multiple words as > one token. for example. When user passes String as International Business > machine logo or IBM logo it should return International Business Machine as > one tok

Custom Tokenizer/Analyzer

2014-02-20 Thread Geet Gangwar
Hi, I have a requirement to write a custom tokenizer using Lucene framework. My requirement is it should have capabilities to match multiple words as one token. for example. When user passes String as International Business machine logo or IBM logo it should return International Business Machine

Re: Custom Tokenizer

2013-12-05 Thread Erick Erickson
ncluded in your tokens etc. You could use this in conjunction with WhitespaceTokenizerFactory for instance. Or as Furukan suggests, use PatternReplaceCharFilterFactory to operate on the entire input before it's broken up by whatever tokenizer you use. Or You _really_ should make the effor

Re: Custom Tokenizer

2013-12-05 Thread Furkan KAMACI
Hi; Standard tokenizer includes of that bydefault: StandardFilter, LowerCaseFilter and StopFilter You can consider char filters. Did you read here: https://cwiki.apache.org/confluence/display/solr/CharFilterFactories Thanks; Furkan KAMACI 2013/12/5 > Hi, > > I have used StandardAn

Custom Tokenizer

2013-12-05 Thread raghavendra.k.rao
Hi, I have used StandardAnalyzer in my code and it is working fine. One of the challenges that I face is the fact that, this Analyzer by default tokenizes on some special characters such as hyphen, apart from the SPACE character. I want to tokenize only on the SPACE character. Could you please

Re: tokenizer to strip a set of characters

2013-11-21 Thread Jack Krupansky
the start or end. -- Jack Krupansky -Original Message- From: Stephane Nicoll Sent: Thursday, November 21, 2013 9:42 AM To: java-user@lucene.apache.org Subject: tokenizer to strip a set of characters Hi, I am using lucene 3.6 and I am looking to a tokenized that would remove certain

tokenizer to strip a set of characters

2013-11-21 Thread Stephane Nicoll
examples: - foo, -> foo (comma at the end) - foo. -> foo (period at the end) - foo -> foo - foo?! -> foo - ,foo -> foo (comma at the beginning of a word is a typo mistake but should be handled- Is there a configurable tokenizer I could use for this? Thanks, S.

Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
: > >> Hi Benson, >> >> the base factory class and the abstract Tokenizer, TpokenFilter and >> CharFilter factory classes are all in Lucene's analyzers-commons module >> (since 4.0). They are no longer part of Solr. >> >> Uwe >> >> ---

Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
Just how 'experimental' is the SPI system at this point, if that's a reasonable question? On Mon, Oct 28, 2013 at 8:41 AM, Uwe Schindler wrote: > Hi Benson, > > the base factory class and the abstract Tokenizer, TpokenFilter and > CharFilter factory classes ar

RE: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Uwe Schindler
Hi Benson, the base factory class and the abstract Tokenizer, TpokenFilter and CharFilter factory classes are all in Lucene's analyzers-commons module (since 4.0). They are no longer part of Solr. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMa

Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
OK, so, here I go again making a public idiot of myself. Could it be that the tokenizer factory is 'relatively recent' as in since 4.1? On Mon, Oct 28, 2013 at 7:39 AM, Benson Margulies wrote: > I'm working on tool that wants to construct analyzers 'at arms length'

Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
I'm working on tool that wants to construct analyzers 'at arms length' -- a bit like from a solr schema -- so that multiple dueling analyzers could be in their own class loaders at one time. I want to just define a simple configuration for char filters, tokenizer, and token filter.

Re: Strange behaviour of tokenizer with wildcard queries

2013-09-20 Thread Ian Lea
>> >> The split into block and major-57 will be because, from the javadocs >> for ClassicTokenizer, "Splits words at hyphens, unless there's a >> number in the token, in which case the whole token is interpreted as a >> product number and is not split."

Re: Strange behaviour of tokenizer with wildcard queries

2013-09-20 Thread Ramprakash Ramamoorthy
is not split.". So I guess it splits on the first > hyphen but not the second. > > ClassicAnalyzer/Tokenizer is general purpose and will never meet > everyone's requirement all the time. You could try a different > analyzer, or build your own. That's what the javadoc

Re: Strange behaviour of tokenizer with wildcard queries

2013-09-20 Thread Ian Lea
case the whole token is interpreted as a product number and is not split.". So I guess it splits on the first hyphen but not the second. ClassicAnalyzer/Tokenizer is general purpose and will never meet everyone's requirement all the time. You could try a different analyzer, or build your

Strange behaviour of tokenizer with wildcard queries

2013-09-20 Thread Ramprakash Ramamoorthy
Sorry, hit the send button accidentally the last time. Please read below : Hello, We're using lucene 4.1. We have the word "*block-major-57*" indexed. Using the classic analyzer, we get the following tokens : *block*and *major-57*. I search for *block-major*, *the docume

Strange behaviour of tokenizer with wildcard queries

2013-09-20 Thread Ramprakash Ramamoorthy
Hello, We're using lucene 4.1. We have the word "block-major-5" indexed. Using the classic analyzer, we get the following tokens : block and major-5. However, -- With Thanks and Regards, Ramprakash Ramamoorthy, Chennai, India.

Re: Exception while creating a Tokenizer

2013-06-12 Thread Gucko Gucko
> > -Original Message- > > From: Gucko Gucko [mailto:gucko.gu...@googlemail.com] > > Sent: Wednesday, June 12, 2013 7:48 PM > > To: java-user@lucene.apache.org > > Subject: Exception while creating a Tokenizer > > > > Hello all, > > > > I&#x

RE: Exception while creating a Tokenizer

2013-06-12 Thread Uwe Schindler
-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Gucko Gucko [mailto:gucko.gu...@googlemail.com] > Sent: Wednesday, June 12, 2013 7:48 PM > To: java-user@lucene.apache.org > Subject: Exception while creating a Tokenizer

Exception while creating a Tokenizer

2013-06-12 Thread Gucko Gucko
n t...@test.com software technology has 4"; Tokenizer tokenizer = new UAX29URLEmailTokenizer(Version.LUCENE_43, new StringReader(text)); TokenStream stream = new LowerCaseFilter(tokenizer); CharTermAttribute term = stream.addAttribute(CharTermAttribute.class); stream.reset(); while( s

Re: problem with wikipedia tokenizer

2013-03-19 Thread Uwe Schindler
Read the documentation about TokenStream and how to consume them correctly. The same problem affecting StandardTokenizer was explained a few days before on this list, too. Sashidhar Guntury schrieb: >hi > >I'm using lucene to query from wiki dump and get the categories out. >So, I >get the r

Re: NullPointerException thrown on tokenizer in 4.1, worked okay in 3.6

2013-02-26 Thread Paul Taylor
On 26/02/2013 12:29, Paul Taylor wrote: This code worked in 3.6 but now throws nullpointer exception in 41, Im not expecting there to be a token created, but surely it shouldn't throw NullPointerException Tokenizer tokenizer = new org.apache.lucene.analysis.standard.StandardToke

Re: ArrayIndexOutOfBoundsException trying to use tokenizer in Lucene 4.1

2013-02-26 Thread Paul Taylor
On 26/02/2013 13:29, Alan Woodward wrote: Hi Paul, You need to call tokenizer.reset() before you call incrementToken() Alan Woodward www.flax.co.uk Hi, thanks that fixes it

Re: ArrayIndexOutOfBoundsException trying to use tokenizer in Lucene 4.1

2013-02-26 Thread Alan Woodward
} >} > System.out.println(sb.toString()); >Tokenizer tokenizer = new > WhitespaceTokenizer(LuceneVersion.LUCENE_VERSION,new > StringReader(sb.toString())); >while(tokenizer.incrementToken()) >{ >

NullPointerException thrown on tokenizer in 4.1, worked okay in 3.6

2013-02-26 Thread Paul Taylor
This code worked in 3.6 but now throws nullpointer exception in 41, Im not expecting there to be a token created, but surely it shouldn't throw NullPointerException Tokenizer tokenizer = new org.apache.lucene.analysis.standard.StandardTokenizer(Version.LUCENE_41, new StringR

ArrayIndexOutOfBoundsException trying to use tokenizer in Lucene 4.1

2013-02-26 Thread Paul Taylor
e(c)) { sb.append(new Character(i).toString() + ' '); } } System.out.println(sb.toString()); Tokenizer tokenizer = new WhitespaceTokenizer(LuceneVersion.LUCENE_VERSION,new StringReader(sb.toString())); while(tokenizer.i

Re: Looking for example code: Tokenizer + Analyzer for Russian stemming

2012-12-19 Thread Steve Rowe
et to the output tokens using termAtt.buffer() and termAtt.length(), or if you're going to Stringify tokens anyway, termAtt.toString(). Steve On Dec 18, 2012, at 1:16 PM, dokondr wrote: > Hello, > I am looking for an example of using Tokenizer + Analyze

Looking for example code: Tokenizer + Analyzer for Russian stemming

2012-12-18 Thread dokondr
> Hello, > I am looking for an example of using Tokenizer + Analyzer (in particular > org.apache.lucene.analysis.ru.RussianAnalyzer) for standalone stemming. > Can't find such an example here: > > http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-s

Re: clearAttributes() not clearing in Tokenizer class

2011-04-30 Thread Ye T Thet
i.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Ye T Thet [mailto:yethura.t...@gmail.com] > > Sent: Saturday, April 30, 2011 5:28 PM > > To: java-user@lucene.apache.org > > Subject: clearAttributes() not clearing in Tokenizer class &

RE: clearAttributes() not clearing in Tokenizer class

2011-04-30 Thread Uwe Schindler
not clearing in Tokenizer class > > Hi All, > > I am using Lucene 3.0.3. I noticed when I called clearAttributes() from my > Tokenizer, the attributes in my TermAttribute object are not being cleared. > > I found the issue tracking here at > https://issues.apache.org/jira/

clearAttributes() not clearing in Tokenizer class

2011-04-30 Thread Ye T Thet
Hi All, I am using Lucene 3.0.3. I noticed when I called clearAttributes() from my Tokenizer, the attributes in my TermAttribute object are not being cleared. I found the issue tracking here at https://issues.apache.org/jira/browse/LUCENE-3042. The status is fixed. It looks like the patch would

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread shrinath.m
hanks Earl :) This is cool :) -- View this message in context: http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2680665.html Sent from the Lucene - Java Users mailing list arc

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread Earl Hood
On Mon, Mar 14, 2011 at 11:46 PM, shrinath.m wrote: > I used Jericho and found it extremely simple to start with ... > > Just wanted to clarify one thing though. > Is there some tool that does extract text from HTML without creating the DOM Looks like Jericho does what you want already: http://je

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread shrinath.m
-- View this message in context: http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2680634.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread Sirish Vadala
t; > Consider we've offline HTML pages, no parsing while crawling, now what ? > Any tokenizer someone has built for this ? > > > How does Solr do it ? > > > -- > Regards > Shrinath.M > -- View this message in context: http://lucene.472066.n3.nabble.co

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-12 Thread Trejkaz
On Fri, Mar 11, 2011 at 10:03 PM, shrinath.m wrote: > I am trying to index content withing certain HTML tags, how do I index it ? > Which is the best parser/tokenizer available to do this ? This doesn't really answer the question, but I think it will help... The features you want to

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Sreejith S
7;ve offline HTML pages, no parsing while crawling, now what ? >> Any tokenizer someone has built for this ? > > In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages > by selecting only text between certain tags, before indexing them. > These are offline Web pages, a

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Bill Janssen
shrinath.m wrote: > Consider we've offline HTML pages, no parsing while crawling, now what ? > Any tokenizer someone has built for this ? In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages by selecting only text between certain tags, before indexing them. These

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread shrinath.m
bble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664717.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Erick Erickson
er we've offline HTML pages, no parsing while crawling, now what ? > Any tokenizer someone has built for this ? > > > How does Solr do it ? > > > -- > Regards > Shrinath.M > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread shrinath.m
ML pages, no parsing while crawling, now what ? Any tokenizer someone has built for this ? How does Solr do it ? -- Regards Shrinath.M -- View this message in context: http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-inde

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Li Li
rote: > > > http://java-source.net/open-source/html-parsers > > > > 2011/3/11 shrinath.m <[hidden email]< > http://user/SendEmail.jtp?type=node&node=2664327&i=0&by-user=t>> > > > > > > > I am trying to index content within

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Ivan KriŔto
Hello! On Fri, Mar 11, 2011 at 12:03 PM, shrinath.m wrote: > I am trying to index content withing certain HTML tags, how do I index it ? > Which is the best parser/tokenizer available to do this ? As a general HTML parser I would recommend "Jericho HTML Parser" - http://jerich

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread shrinath.m
I am trying to index content withing certain HTML tags, how do I index it > ? > > Which is the best parser/tokenizer available to do this ? > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tok

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Li Li
http://java-source.net/open-source/html-parsers 2011/3/11 shrinath.m > I am trying to index content withing certain HTML tags, how do I index it ? > Which is the best parser/tokenizer available to do this ? > > -- > View this message in context: > http://lucene.472066.n3.nabbl

Running a string through a simple Tokenizer, and then additional Tokenizers (vs. TokenFilters)

2011-02-10 Thread Tavi Nathanson
Hey everyone, I'm trying to do the following: 1. Run a string through a simple tokenizer (i.e. WhitespaceTokenizer) 2. Run the resultant tokens through my current tokenizer as well as StandardTokenizer, in order to isolate the tokens that are different between them. (Background: I wa

Re: Interaction of Tokenattributes and Tokenizer

2010-08-14 Thread Simon Willnauer
You might wanna look at the "Whats new in Lucene 2.9" Whitepaper from Lucid Imagination http://www.lucidimagination.com/developer/whitepaper/Whats-New-in-Apache-Lucene-2-9 on page 7 you find an introduction to this API. This should get you started :) simon On Sat, Aug 14, 2010 at 4:19 PM, Devsh

Interaction of Tokenattributes and Tokenizer

2010-08-14 Thread Devshree Sane
Hi, Can anyone explain to me how exactly the Tokenizers and tokenattributes interact with each other? Or perhaps point me to a link which has a the interaction/sequence diagram for the same? I want to extend the Token class to allow use of some more types of Token Attributes. Thanks -Devshree.

Re: A full-text tokenizer for the NGramTokenFilter

2010-07-17 Thread Martin
Ahh, I knew I saw it somewhere, then I lost it again... :) I guess the name is not quite intuitive, but anyway thanks a lot! and I'm just wondering if there is a tokenizer that would return me the whole text. KeywordTokenizer does

Re: A full-text tokenizer for the NGramTokenFilter

2010-07-17 Thread Ahmet Arslan
> and I'm just wondering if there is a tokenizer > that would return me the whole text. KeywordTokenizer does this. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional comma

A full-text tokenizer for the NGramTokenFilter

2010-07-17 Thread Martin
issue being discussed a couple of years ago and the proposed solution there has been using the NGramTokenFilter. Now that filter certainly works, but it needs an underlying tokenizer to work with, and I'm just wondering if there is a tokenizer that would return me the whole text. The rea

RE: Using the new tokenizer API from a jar file

2010-01-04 Thread Uwe Schindler
2010 10:33 PM > To: java-user@lucene.apache.org > Subject: Re: Using the new tokenizer API from a jar file > > Sorry for this delay. I was having a silly problem compiling solr but I > figured it out. > I tested it and it worked correctly. Thanks > > On Wed, Dec 30, 20

  1   2   3   >