Format of Wikipedia Index

2018-01-22 Thread Armins Stepanjans
Hi, I have a question regarding the format of the Index created by DocMaker, from EnWikiContentSource. After creating the Index from dump of all Wikipedia's articles ( https://dumps.wikimedia.org/enwiki/latest/enwiki-latest- pages-articles-multistream.xml.bz2), I'm having trouble understanding th

Re: Analyzer is not called upon executing addDocument()

2018-01-09 Thread Armins Stepanjans
acets). > > Uwe > > - > Uwe Schindler > Achterdiek 19, D-28357 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Armins Stepanjans [mailto:armins.bagr...@gmail.com] > > Sent: Tuesday, January 9, 2018 2:52 PM

Re: Maven snapshots

2018-01-09 Thread Armins Stepanjans
Hi, I'm not sure I understand your question. There should be no confusion about setting a Maven snapshot dependency in the pom file, as you can specify version with 8.0-SNAPSHOT (substituting 8.0 with the version you want). However, in the case you are looking for a particular version of Lucene,

Analyzer is not called upon executing addDocument()

2018-01-09 Thread Armins Stepanjans
Hi, When I create a document with multiple StringFields and add it to IndexWriter using addDocument(Document), the StringFields within the Document are not tokenized nor filtered according to Analyzer's specifications, however when I test my Analyzer, while looping through tokens by explicitly cal

Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
Hi, I am looking for a tokenizer, where I could specify a delimiter by which the words are tokenized, for example if I choose the delimiters as ' ' and '_' the following string: "foo__bar doo" would be tokenized into: "foo", "", "bar", "doo" (The analyzer could further filter empty tokens, since h

Re: Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
dler > Achterdiek 19, D-28357 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Armins Stepanjans [mailto:armins.bagr...@gmail.com] > > Sent: Monday, January 8, 2018 2:09 PM > > To: java-user@lucene.apache.org

Re: Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
k = CharTokenizer.fromSeparatorCharPredicate(ch -> > Character.isWhitespace || ch == '_'); > > Uwe > > - > Uwe Schindler > Achterdiek 19, D-28357 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From:

Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
Hi, I am looking for a tokenizer, where I could specify a delimiter by which the words are tokenized, for example if I choose the delimiters as ' ' and '_' the following string: "foo__bar doo" would be tokenized into: "foo", "", "bar", "doo" (The analyzer could further filter empty tokens, since h