We have identified the reason for slowness... Lucene41PostingsWriter encodes postings-list as VInt when block-size < 128 and takes a FOR coding approach otherwise...
Most of our terms falls under VInt and that's why decompression during merge-reads was eating up a lot of CPU cycles... We switched it to write using ForUtil even if block-size<128 and perf was much better and predictable. Are there any particular reasons for taking the VInt approach? Any help on this issue is appreciated -- Ravi On Tue, Nov 18, 2014 at 12:49 PM, Ravikumar Govindarajan < ravikumar.govindara...@gmail.com> wrote: > Hi, > > I am finding that lucene is slowing down a lot when bigger and bigger > doc/pos files are merged... While it's normally the case, the worrying part > is all my data is in RAM. Version is 4.6.1 > > Some sample statistics took after instrumenting the SortingAtomicReader > code, as we use a SortingMergePolicy. The times displayed are just for > reading {ex: in.nextDoc(), in.nextPosition()}. It does not include > tim-sorting or new-segment writing times > > *337 sec* to merge postings [*281655 docs*] with > *SortingDocsAndPositionEnum-nextPosition()* as [*130sec*] and *Sorting* > *DocsAndPositionEnum-nextDoc()* as [*232sec*] and total-num-terms as [ > *2,058,600*] > > *482 sec* to merge postings [*475143 docs*] with *Sorting* > *DocsAndPositionEnum-nextPosition()* as [*204sec*] and *Sorting* > *DocsAndPositionEnum-nextDoc()* as [*332sec*] and total-num-terms as [ > *3,791,065*] > > *898 sec* to merge postings [*890385 docs*] with *Sorting* > *DocsAndPositionEnum-nextPosition()* as [*343sec*] and *Sorting* > *DocsAndPositionEnum-nextDoc()* as [*609sec*] and total-num-terms as [ > *5,470,110*] > > *1000 sec* to merge postings [*950084 docs*] with *Sorting* > *DocsAndPositionEnum-nextPosition()* as [*361sec*] and *Sorting* > *DocsAndPositionEnum-nextDoc()* as [*686sec*] and total-num-terms as [ > *1,108,744*] > > I went ahead and did an "mlock" on already mmapped doc/pos files and then > proceeded for merge, to eliminate disk. The numbers shown above come for > iterating all terms/docs/positions sequentially from RAM!! > > I understand that there are no bulk-merge of postings currently available, > but given that data is in RAM, doesn't it indicate a slow-down? Is there > some configuration I am missing etc... to speed this up? > > -- > Ravi > > > [P.S: I have not verified whether all pages reside in RAM, but "mlock" > doesn't throw any Exceptions and returns success...] >