Re: custom low-level indexer (to speed things up) when fields, terms and docids are in order

britske Wed, 07 Apr 2010 12:09:52 -0700

Just to update and close this thread (I forgot about it) :

after investigation it turns out that 75% of the time of the custom
async-indexer (see original email) was spend in FieldInfos.add(...) . More
specifically in the part where fieldname is interned using String.intern().
Copy/pasing and using a custom FieldInfos class which doesn't use
String.intern() increased throughput a lot..
In my situation (custom low-level indexer) I don't rely on String.intern()
in the lucene-code. (no merging of segments etc. where I believe the code is
used.)
anyway, throughput increased a lot, and I'm satisfied.


Thanks, I've learned a lot from the Lucene internals.
Geert-Jan

2010/3/26 Geert-Jan Brits <gbr...@gmail.com>

> define fun ;-)
>
> indeed I created a custom indexing chain, plugged it in and all works
> well.
> I'm currently trying to isolate the critical parts to double test that
> indeed most of the time goes in the indexing process.
> It's kind of hard doing accurate measurements bc. the indexer is designed
> to be running asychronously from the process that calculates the maps etc
> ,so to make use of concurrent IO and CPU.
>
> however from a birds-eys view: disabling the async custom indexer and only
> doing the described calculating/ populating of the ordered maps increases
> throughput from 1.8 docs/sec to 5.4 docs/sec, while with or without aysnc
> indexer enabled the total application is never cpu-bound. (not even close)
>
> Investigating more, and reporting back.
>
> Glad you liked the case ;-)
>
> Geert-Jan
>
> 2010/3/26 Michael McCandless-2 [via Lucene] <
> ml-node+676471-1074052327-38...@n3.nabble.com<ml-node%2b676471-1074052327-38...@n3.nabble.com>
> >
>
>  This sounds like fun :)
>>
>> So you've already created a custom indexing chain, and plugged this
>> into DocumentsWriter?  And this chain directly interacts with the low
>> level classes for writing a segment (FormatPostingsTerms/DocsConsumer,
>> etc.)?
>>
>> I'm not sure you're gonna do much better than that... these classes
>> already expect things "in order" and all they do (pretty much) is
>> write the index files.  I think they should be pretty lean...
>>
>> Also, once flex lands, soon (note that it moves these low level
>> interfaces/classes around)... since you're using these classes for
>> writing, it'll mean you can freely swap in different codecs.
>>
>> The only thing you can do further is to conflate your custom code with
>> the codec, ie, so that you make a single chain that directly writes
>> index files.  But I'm not sure you'll gain much performance by doing
>> so... (and then you can't [as easily] swap codecs).
>>
>> Have you profiled to see where the time is being spent?
>>
>> Mike
>>
>> On Thu, Mar 25, 2010 at 7:40 PM, britske <[hidden 
>> email]<http://n3.nabble.com/user/SendEmail.jtp?type=node&node=676471&i=0>>
>> wrote:
>>
>> >
>> > Hi,
>> >
>> > perhaps first some background:
>> >
>> > I need to speed-up indexing for an particular application which has a
>> pretty
>> > unsual schema: besides the normal stored and indexed fields we have
>> about
>> > 20.000 fields  per document which are all indexed/ non-stored sInts.
>> >
>> > Obviously indexing was really slow with such a number of fields. With
>> > indexing through Solr we got about 0.3 docs/ sec ( on a ec2 m1.large
>> > instance)
>> >
>> > Since these ~20.000 fields are all build/ calculated analogously, we
>> figured
>> > it would be possible to possibly build a low-level indexer for these
>> fields
>> > (of which we had domain knowledge which we could use to possibly speed
>> > indexing up) and later merge them with the other fields to construct the
>>
>> > entire index. So we did and now achieve around 1.8 docs/ sec (6x
>> speedup).
>> > Not bad, but still not enough.
>> >
>> > As part of calculating these fields, we keep track of all fields, terms
>> per
>> > field, and docids per term ( per field) .
>> > all the stuff is then ordered: the fields, the terms available for each
>> > field, the docids per term and inserted in that order using lowel-level
>> > classes like: FormatPostingsFieldsWriter , FormatPostingsTermsConsumer,
>> > FormatPostingsDocsConsumer (pseudo-code below)
>> >
>> > This constructs the following files: .tis, .tii, .frq.  ( + some default
>>
>> > values for the other required files, which don't need actual data bc.
>> these
>> > fields are not stored..)
>> >
>> > I should also mention that each call to the indexer writes all available
>>
>> > fields, terms, docids to a new fsdirectory. So basically each call
>> results
>> > in a complete index. (containing about 100 docs each, bc. otherwise we
>> run
>> > into mem-problems keeping the  ordered maps)
>> >
>> > Since we already have all fields, terms, docids in order it seems (a lot
>> of
>> > ) overkill to me to be going through the methods that above-mentioned
>> > classes offer, which were meant for more 'non-sequential / non-ordered'
>> > inserts (AFAIK).
>> >
>> > What would be the best way to write .tis, .tii and .frq ni a more
>> sequential
>> > matter?
>> > I'm looking for something that would construct a byte-array for each
>> file
>> > that conforms to the index-file definition of that particular file (or
>> > something) . I could try to do it myself and completely bypass all
>> > indexing-classes altogether and just write the files to disk. (Possible
>> bc.
>> > as mentioned we have all data needed to construct a complete index) .
>> >
>> > However, perhaps there are classes I'm not aware of that help in getting
>> the
>> > format right (it seems like a lot of trial-and-error coding otherwise)
>> >
>> > Thanks for any help, pointers, etc.
>> >
>> > Geert-Jan
>> >
>> >
>> >
>> > PSEUDO-CODE of the current low-level (not-so) sequential indexer:
>> >
>> >        foreach(String sField: fieldsInOrder){
>> >                --> add field to FieldInfos and grab newly created
>> fieldInfo
>> >                --> add fieldInfo to FormatPostingsFieldsWriter and grab
>> > formatPostingsTermsConsumer
>> >                List<String> termsInOrder =
>> termsInOrderForFieldMap(sField);
>> >                foreach(String sTerm: termsInOrder){
>> >                        FormatPostingsDocsConsumer frq =
>> formatPostingsTermsConsumer.add(sTerm);
>> >                        List<Integer> docidsInOrderPerFieldTerm =
>> > docidsInOrderPerFieldTermMap(sField+"-"+sTerm);
>> >                        for(Integer docid:docidsInOrderPerFieldTerm){
>> >                                frq.addDoc(docid);
>> >                        }
>> >                        //close relevant stuff
>> >                }
>> >                //close relevant stuff
>> >        }
>> >        //close relevant stuff
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://n3.nabble.com/custom-low-level-indexer-to-speed-things-up-when-fields-terms-and-docids-are-in-order-tp576998p576998.html
>> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden 
>> > email]<http://n3.nabble.com/user/SendEmail.jtp?type=node&node=676471&i=1>
>> > For additional commands, e-mail: [hidden 
>> > email]<http://n3.nabble.com/user/SendEmail.jtp?type=node&node=676471&i=2>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden 
>> email]<http://n3.nabble.com/user/SendEmail.jtp?type=node&node=676471&i=3>
>> For additional commands, e-mail: [hidden 
>> email]<http://n3.nabble.com/user/SendEmail.jtp?type=node&node=676471&i=4>
>>
>>
>>
>> ------------------------------
>>  View message @
>> http://n3.nabble.com/custom-low-level-indexer-to-speed-things-up-when-fields-terms-and-docids-are-in-order-tp576998p676471.html
>> To unsubscribe from custom low-level indexer (to speed things up) when
>> fields, terms and docids are in order, click here< (link removed) ==>.
>>
>>
>>
>

-- 
View this message in context: 
http://n3.nabble.com/custom-low-level-indexer-to-speed-things-up-when-fields-terms-and-docids-are-in-order-tp576998p703958.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: custom low-level indexer (to speed things up) when fields, terms and docids are in order

Reply via email to