Thank you all for your answers. Initially, I also thought that shingle
should make a huge difference. I will give a try to the CommonGramsFilter.
In the mean time, these additional informations may help you at identifying
a problem in my setup.

Basically, I indexed the whole wikipedia dump (> 8 min articles, the index
size is 20G on disk). I also extracted a set of 1000 random sentences from
the dump in order to create phrase queries and ran the following algorithm
file:

# Properties
directory=FSDirectory
work.dir=/media/sdb/wikipedia/index/2-shingle
task.max.depth.log=2
log.queries=true
query.file=/media/sdb/wikipedia/data/queries.txt.gz
query.maker=ch.unil.doplab.text.ShingleQueryMaker
query.shingle=2
# Algorithm
{ "Rounds"
    OpenReader
    { "SearchSameRdr" Search > : 1000
    CloseReader
    NewRound
} : 10
RepSumByName

The ShingleQueryMaker uses the filter I mentioned in my previous mail. I
also tried to warm the reader ({ "WarmRdr" Warm > : 1) without noticing
huge differences. Is there another way to warm the index before performing
the queries?

The machine on which I run the benchmark has 16GB of RAM and a xeon cpu.
The benchmark is using a lot of memory (~40-50%) and according to the
javadoc the benchmark script I run is single threaded and the cpu usage
reflect that (~100%). Are there some other parameters I should check?

Thank you very much.


On 21 January 2016 at 21:14, Michael McCandless <luc...@mikemccandless.com>
wrote:

> Shingles should make a huge different on phrase query performance if
> 1) the phrase queries involve high frequency terms and 2) you have a
> substantial number of documents in the index (so that
> time-to-visit-postings dominates over time-to-lookup-terms).
>
> 118 rec/sec is already very fast for a long phrase on a large index
> ... how many documents in your index.
>
> You could also try using CommonGramsFilter instead: it's like
> shingles, but only for high frequency terms, so you get less increase
> on your index size but big perf gains for the otherwise slow phrase
> queries.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Jan 21, 2016 at 1:23 PM, Bertil Chapuis <bchap...@gmail.com>
> wrote:
> > Hello,
> >
> > I'm trying improve the speed of an index when searching for long
> phrases. I
> > performed some tests with the benchmark module. With a simple analyser
> and
> > PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the
> > latest dump of wikipedia. Here is the filters I use at indexation and
> query
> > time:
> >
> > var filter: TokenFilter = new StandardFilter(tokenizer)
> > filter = new LowerCaseFilter(filter)
> > filter = new EnglishPossessiveFilter(filter)
> > filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> > filter = new SnowballFilter(filter, "English")
> >
> > In order to improve performances I tried to add a ShingleFilter and did
> > some benchmark with PhraseQueries and BooleanQueries (Should, Must) and
> in
> > both cases got a lower throughput (respectively 83rec/sec and 84
> rec/sec).
> > Here is the filter:
> >
> > var filter: TokenFilter = new StandardFilter(tokenizer)
> > filter = new LowerCaseFilter(filter)
> > filter = new EnglishPossessiveFilter(filter)
> > filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> > filter = new SnowballFilter(filter, "English")
> > val shingleFilter =  new ShingleFilter(filter, 2, 2)
> > shingleFilter.setOutputUnigrams(false)
> > filter = shingleFilter
> >
> > From what I read, the performances should be better, but I'm unable to
> get
> > the desired results. Has anyone some advices on the best way to use
> shingle
> > in order to improve performances? Should I use some other form of Query?
> >
> > Thank you in advance for your help.
> >
> > Regards,
> >
> > Bertil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to