Tokenizer for Brown Corpus?

2015-02-23 Thread Koji Sekiguchi
Hello, Doesn't Lucene have a Tokenizer/Analyzer for Brown Corpus? There doesn't seem to be such tokenizers/analyzers in Lucene. As I didn't want re-inventing the wheel, so I googled, I got the list of snippets that include "the quick brown fox..." :) Koji ---

AW: Lucene 4.x -> 5 : IllegalStateException while sorting

2015-02-23 Thread Clemens Wyss DEV
Thanks! >Solr...falls back to wrapping with UninvertingReader Thats's why my Solr UnitTests stayed green ;) >But in general, you should really enable DocValues for fields you want to sort >on I will! -Ursprüngliche Nachricht- Von: Uwe Schindler [mailto:u...@thetaphi.de] Gesendet: Montag

MemoryIndex slow for BooleanQuery with non-required clause

2015-02-23 Thread Ryan, Michael F. (LNG-DAY)
(I'm using Lucene 4.9.0) I've been doing some perf testing of MemoryIndex, and have found that it is much slower when a BooleanQuery contains a non-required clause, compared to when it just contains required clauses. Most of the time is spent in BooleanScorer, which as far as I can tell is an

Re: Customscorequery and payload

2015-02-23 Thread Alexey Morozov
I have solved a similar task of taking payload into account for fuzzy queries 12 февраля 2015 г. 2:58:10 GMT+06:00, Sheng пишет: >fellas, > >I am wondering if it is possible to wrap payload query with >customscorequery, so that one can tweak the search score with both >payload >similarity and a c

RE: Lucene 4.x -> 5 : IllegalStateException while sorting

2015-02-23 Thread Uwe Schindler
Hi, Solr uses DocValues and falls back to wrapping with UninvertingReader, if user have not indexed them (with negative startup performance and memory effects). But in general, you should really enable DocValues for fields you want to sort on. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-2

RE: write.lock is not removed

2015-02-23 Thread Uwe Schindler
Hi, The existence of the write.lock file has nothing to do with actual locking. The lock file is just a placeholder file (0 bytes) on which the lock during indexing is applied (the actual locking is done on this file using a fnctl). When indexing finishes, the lock is removed from this file, bu

RE: Lucene 5 : createComponents without reader

2015-02-23 Thread Uwe Schindler
Hi, Also Tokenizer no longer has a Reader in its ctor. Tokenizers are constructed without any reader. To consume the TokenStream one has to set the reader using setReader(Reader). Because of that createComponents does not need to get a Reader, too. The tokenStream method takes a String or Read

alternative for TermsQuery

2015-02-23 Thread Sascha Janz
i use TermsQuery for creating a join query. the list of terms could be quite large. e.g. million entries. when this is the case, the IntroSorter sorting the terms becomes a performance bottleneck. could i use an other strategy or algorithm for building those joins on large sets of terms? an

AW: Lucene 4.x -> 5 : IllegalStateException while sorting

2015-02-23 Thread Clemens Wyss DEV
Thanks for pointer. How does/did this change make its way into Solr? -Ursprüngliche Nachricht- Von: András Péteri [mailto:apet...@b2international.com] Gesendet: Montag, 23. Februar 2015 14:13 An: java-user@lucene.apache.org Betreff: Re: Lucene 4.x -> 5 : IllegalStateException while sortin

Re: write.lock is not removed

2015-02-23 Thread Robert Muir
Thats why locking didnt work correctly back then. On Mon, Feb 23, 2015 at 8:18 AM, Just Spam wrote: > Any reason? > I remember in 3.6 the lock was removed/deleted? > > > 2015-02-23 14:13 GMT+01:00 Robert Muir : > >> It should not be deleted. Just don't mess with it. >> >> On Mon, Feb 23, 2015 at

Re: write.lock is not removed

2015-02-23 Thread Just Spam
Any reason? I remember in 3.6 the lock was removed/deleted? 2015-02-23 14:13 GMT+01:00 Robert Muir : > It should not be deleted. Just don't mess with it. > > On Mon, Feb 23, 2015 at 7:57 AM, Just Spam wrote: > > Hello, > > > > i am trying to index a file (Lucene 4.10.3) – in my opinion in the >

Re: Lucene 4.x -> 5 : IllegalStateException while sorting

2015-02-23 Thread András Péteri
Hi Clemens, I think this part of the release notes [1] applies to your case: * FieldCache is gone (moved to a dedicated UninvertingReader in the misc module). This means when you intend to sort on a field, you should index that field using doc values, which is much faster and less heap consuming

Re: write.lock is not removed

2015-02-23 Thread Robert Muir
It should not be deleted. Just don't mess with it. On Mon, Feb 23, 2015 at 7:57 AM, Just Spam wrote: > Hello, > > i am trying to index a file (Lucene 4.10.3) – in my opinion in the correct > way – will say: > > get the IndexWriter, Index the Doc and add them, prepare commit, commit and > finally{

write.lock is not removed

2015-02-23 Thread Just Spam
Hello, i am trying to index a file (Lucene 4.10.3) – in my opinion in the correct way – will say: get the IndexWriter, Index the Doc and add them, prepare commit, commit and finally{ close}. My writer is generated like so: private IndexWriter getDataIndexWriter() throws CorruptIndexExcept

Lucene 4.x -> 5 : IllegalStateException while sorting

2015-02-23 Thread Clemens Wyss DEV
After upgrading to Lucene 5 one of my unittest which tests sorting fails with: unexpected docvalues type NONE for field 'providertestfield' (expected=SORTED). Use UninvertingReader or index with docvalues What am I missing?

AW: Lucene 5 : createComponents without reader

2015-02-23 Thread Clemens Wyss DEV
Got this one sorted out. I was still referencing the 4.x lucene-analyzers.jar which required the reader ;) Sorry for the noise! -Ursprüngliche Nachricht- Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] Gesendet: Montag, 23. Februar 2015 12:42 An: java-user@lucene.apache.org Betreff:

Lucene 5 : createComponents without reader

2015-02-23 Thread Clemens Wyss DEV
My custom Analyzer had the following (Lucene 4) impl of createComponents: protected TokenStreamComponents createComponents ( final String fieldName, final Reader reader ) { Tokenizer source = new KeywordTokenizer( reader ); TokenStream