Re: Document serializable representation

2017-03-30 Thread Denis Bazhenov
Hi. Thanks for the reply. Of course each document go into exactly one shard. > On Mar 31, 2017, at 15:01, Erick Erickson wrote: > > I don't believe addIndexes does much except rewrite the > segments file (i.e. the file that tells Lucene what > the current segments are). > > That said, if you'r

Re: Document serializable representation

2017-03-30 Thread Erick Erickson
I don't believe addIndexes does much except rewrite the segments file (i.e. the file that tells Lucene what the current segments are). That said, if you're desperate you can optimize/force-merge. Do note, though, that no deduplication is done. So if the indexes you're merging have docs with the s

Re: Document serializable representation

2017-03-30 Thread Denis Bazhenov
Yeah, I definitely will look into PreAnalyzedField as you and Michail suggest. Thank you. > On Mar 30, 2017, at 19:15, Uwe Schindler wrote: > > But that's hard to implement. I'd go for Solr instead of doing that on your > own! --- Denis Bazhenov

Re: Document serializable representation

2017-03-30 Thread Denis Bazhenov
Interesting. In case of addIndexes() does Lucene perform any optimization on segments before searching over individual segments or those indexes are searched "as is”? > On Mar 30, 2017, at 19:09, Mikhail Khludnev wrote: > > I believe you can have more shards for indexing and then merge (and no

Re: Adding TokenFilters to a CustomAnalyzer is too inflexible

2017-03-30 Thread Nicolás Lichtmaier
Ok, so a flexible interface would be to be able to pass some TokenFilterFactory that would be called each time a TokenFilter is needed. Would that be ok? El 30/03/17 a las 03:47, Uwe Schindler escribió: A TokenFilter object already build won't work, because the Analyzer must create new insta

Re: CustomAnalyzer.Builder builder()

2017-03-30 Thread Leonardo Pérez Pulido
Thank you Uwe, that was really helpful. Leonardo. From: Uwe Schindler Sent: 30 March 2017 13:14:17 To: java-user@lucene.apache.org Subject: RE: CustomAnalyzer.Builder builder() Empty is not null: > .addTokenFilter(StopFilterFactory.class, > "ignoreCase",

RE: CustomAnalyzer.Builder builder()

2017-03-30 Thread Uwe Schindler
Empty is not null: > .addTokenFilter(StopFilterFactory.class, > "ignoreCase", "true", > "words", "", > "format", "wordset") This will cause the empty named file be loaded, which may not work with all class loaders. Just remove the useless parameters. Remove "words" and "format". I

Re: CustomAnalyzer.Builder builder()

2017-03-30 Thread Leonardo Pérez Pulido
Hi, I do not know to what file name do you refer, can you please be a little more specific? Did you mean a stopwords file name? Currently I am developing an application which do not have a name yet, this is why you see the class MySearchApp, which is a test source file on which I am testing

RE: CustomAnalyzer.Builder builder()

2017-03-30 Thread Uwe Schindler
Hi, I am still a bit confused why you use an empty file name! Is this just copypasted here for privacy reasons wthout filename, or is there really no file name? This would explain why it may not work with the defaults. Uwe - Uwe Schindler Achterdiek 19, D-28357 Bremen http://www.thetaphi.d

Re: CustomAnalyzer.Builder builder()

2017-03-30 Thread Leonardo Pérez Pulido
Hi, Yes you are right, using the ClasspathResourceLoader class did solved the issue, passing my on class as parameter: ClasspathResourceLoader resourceLoader = new ClasspathResourceLoader(MySearchApp.class); Analyzer analyzer = CustomAnalyzer.builder(resourceLoader) .withTokenizer(

RE: 5.x to 6.x migration: replacement for Lucene50Codec

2017-03-30 Thread Uwe Schindler
Hi, > >>> I should have mentioned that I for compatibility reasons still need to > >>> be able to read/write indexes created with the old version, i.e., with > >>> the 5.0 codec. > > > > The old codecs are read-only! As said before, you can only specify the > codec for IndexWriter. That means new s

Re: 5.x to 6.x migration: replacement for Lucene50Codec

2017-03-30 Thread Andreas Sewe
Hi Uwe, >>> I should have mentioned that I for compatibility reasons still need to >>> be able to read/write indexes created with the old version, i.e., with >>> the 5.0 codec. > > The old codecs are read-only! As said before, you can only specify the codec > for IndexWriter. That means new sege

RE: Document serializable representation

2017-03-30 Thread Uwe Schindler
Hi, there is no easy way to do this with Lucene. The analysis part is tightly bound to IndexWriter. There are ways to decouple this, but you have to write your own Analyzer and some network protocol. Solr has something lik this, it's called PreAnalyzedField: This is a field type that has some

Re: Document serializable representation

2017-03-30 Thread Mikhail Khludnev
I believe you can have more shards for indexing and then merge (and not literally, but just by addIndexes() or so ) them to smaller number for search. Transferring indices is more efficient (scp -C) than separate tokens and their attributes over the wire. On Thu, Mar 30, 2017 at 12:02 PM, Denis Ba

Re: Document serializable representation

2017-03-30 Thread Denis Bazhenov
We already have done this. Many years ago :) At the moment we have 7 shards. The problem with getting more shards is that search become less cost effective (in terms of cluster CPU time per request) as you split index in more shards. Considering response time is good enough and the fact search

RE: Document serializable representation

2017-03-30 Thread Uwe Schindler
Hi, the document does not contain the analyzed tokens. The Lucene Analyzers are called inside the IndexWriter *during* indexing, so there is no way to do that somewhere else. The IndexableDocument instances by Lucene are just iterables of IndexableField that contain the unparsed fulltext as pas

Re: Index error

2017-03-30 Thread Trejkaz
What if totalHits > 1? TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Document serializable representation

2017-03-30 Thread Denis Bazhenov
Hi. We have in-house distributed Lucene setup. 40 dual-socket servers with approximatley 700 cores divided in 7 partitions. Those machines are doing index search only. Indexes are prepared on several isolated machines (so called, Index Masters) and distributed over the cluster with plain rsync.

RE: 5.x to 6.x migration: replacement for Lucene50Codec

2017-03-30 Thread Uwe Schindler
Hi, > > I should have mentioned that I for compatibility reasons still need to > > be able to read/write indexes created with the old version, i.e., with > > the 5.0 codec. The old codecs are read-only! As said before, you can only specify the codec for IndexWriter. That means new segemnts to al

RE: 5.x to 6.x migration: replacement for Lucene50Codec

2017-03-30 Thread Uwe Schindler
Hi, you have to define your own codec only during indexing, so you can just update that for the migration. This then affects all new segments written to your index. To read indexes, Lucene will automatically load the codec based on the names written to index files. If you want to open 5.x inde

Re: 5.x to 6.x migration: replacement for Lucene50Codec

2017-03-30 Thread Andreas Sewe
Hi Adrien, > If you move to Lucene 6.1, then this should be Lucene60Codec. More > generally that would be the same codec that is returned by Codec.getDefault. I should have mentioned that I for compatibility reasons still need to be able to read/write indexes created with the old version, i.e., w