Thank you all for your answers. Initially, I also thought that shingle should make a huge difference. I will give a try to the CommonGramsFilter. In the mean time, these additional informations may help you at identifying a problem in my setup.
Basically, I indexed the whole wikipedia dump (> 8 min articles, the index size is 20G on disk). I also extracted a set of 1000 random sentences from the dump in order to create phrase queries and ran the following algorithm file: # Properties directory=FSDirectory work.dir=/media/sdb/wikipedia/index/2-shingle task.max.depth.log=2 log.queries=true query.file=/media/sdb/wikipedia/data/queries.txt.gz query.maker=ch.unil.doplab.text.ShingleQueryMaker query.shingle=2 # Algorithm { "Rounds" OpenReader { "SearchSameRdr" Search > : 1000 CloseReader NewRound } : 10 RepSumByName The ShingleQueryMaker uses the filter I mentioned in my previous mail. I also tried to warm the reader ({ "WarmRdr" Warm > : 1) without noticing huge differences. Is there another way to warm the index before performing the queries? The machine on which I run the benchmark has 16GB of RAM and a xeon cpu. The benchmark is using a lot of memory (~40-50%) and according to the javadoc the benchmark script I run is single threaded and the cpu usage reflect that (~100%). Are there some other parameters I should check? Thank you very much. On 21 January 2016 at 21:14, Michael McCandless <luc...@mikemccandless.com> wrote: > Shingles should make a huge different on phrase query performance if > 1) the phrase queries involve high frequency terms and 2) you have a > substantial number of documents in the index (so that > time-to-visit-postings dominates over time-to-lookup-terms). > > 118 rec/sec is already very fast for a long phrase on a large index > ... how many documents in your index. > > You could also try using CommonGramsFilter instead: it's like > shingles, but only for high frequency terms, so you get less increase > on your index size but big perf gains for the otherwise slow phrase > queries. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Jan 21, 2016 at 1:23 PM, Bertil Chapuis <bchap...@gmail.com> > wrote: > > Hello, > > > > I'm trying improve the speed of an index when searching for long > phrases. I > > performed some tests with the benchmark module. With a simple analyser > and > > PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the > > latest dump of wikipedia. Here is the filters I use at indexation and > query > > time: > > > > var filter: TokenFilter = new StandardFilter(tokenizer) > > filter = new LowerCaseFilter(filter) > > filter = new EnglishPossessiveFilter(filter) > > filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET) > > filter = new SnowballFilter(filter, "English") > > > > In order to improve performances I tried to add a ShingleFilter and did > > some benchmark with PhraseQueries and BooleanQueries (Should, Must) and > in > > both cases got a lower throughput (respectively 83rec/sec and 84 > rec/sec). > > Here is the filter: > > > > var filter: TokenFilter = new StandardFilter(tokenizer) > > filter = new LowerCaseFilter(filter) > > filter = new EnglishPossessiveFilter(filter) > > filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET) > > filter = new SnowballFilter(filter, "English") > > val shingleFilter = new ShingleFilter(filter, 2, 2) > > shingleFilter.setOutputUnigrams(false) > > filter = shingleFilter > > > > From what I read, the performances should be better, but I'm unable to > get > > the desired results. Has anyone some advices on the best way to use > shingle > > in order to improve performances? Should I use some other form of Query? > > > > Thank you in advance for your help. > > > > Regards, > > > > Bertil > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >