Hi Simon,

thanks for your answer. My comments below:
so you mean you would want to do that analysis on the client side and
only shoot the already tokenized values to the server?
What exactly is too slow? Can you provide more info what the problem is?

After all I think you should ask on the solr-user list instead.
The point is, I'm using some quite sophisticated NLP pipeline which outputs data I'd like to index. I have a component which maps this data structure (actually a UIMA CAS object) to lucene documents. A lot of things are done with the input data, including some quite custom adaptions of position_increments and aligning several other TokenStreams in terms of again position_increment and position_offset. I cannot do such things with the native Solr XML format because I need several fields with the same name but different indexing / storing options. This is because I enrich my documents' texts with meta data extracted by my pipeline. So a field gets much more terms then could have been extracted from the text by Lucene/Solr analysis. Solr approximates this capability by multi-valued fields but this can't work the same.

I measured the timings for batches of 1000 documents sent to Solr. I am sending the whole UIMA CAS in a serialized form which is a quite verbose XMI format.
Processing 1000 documents in Solr then takes
* approx. 11sec for deserialization
* approx. 4sec for mapping to a document
* less then 1sec for writing the documents to the index.

So most of the times gets lost by deserialization. By sending the Lucene documents directly I hope to reduce this overhead greatly as I'm not sending the verbose raw data but an already condensed form. Second, the time for the mapping still takes some time for work which not necessarily has to be done on the server side. I can scale the clients arbitrarily so they should do most of the work.

This is why I'd like to build the Lucene documents on the client side and just send them to server. But now I wonder if this is possible at all after the serialization of lucene documents failed...

Sorry for the long read and thanks for you help :)

Erik
Simon
Thanks for any hints!

Regards,

    Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to