Apologies if you already do something similar, but perhaps of general
interest...

One (different approach) to your problem is to implement a local
fingerprint - if you want to find documents with overlapping segments, this
algorithm will dramatically reduce the number of segments you create/search
for every document

http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

Then you simply end up indexing each document, and upon submission:
computing fingerprints and querying for them. I don't know (ie. remember)
exact numbers, but my feeling is that you end up storing ~13% of document
text (besides, it is a one token fingerprint, therefore quite fast to
search for - you could even try one huge boolean query with 1024 clauses,
ouch... :))

roman

On Thu, Mar 28, 2013 at 11:43 AM, Mike Haas <mikehaas...@gmail.com> wrote:

> Hello. My company is currently thinking of switching over to Solr 4.2,
> coming off of SQL Server. However, what we need to do is a bit weird.
>
> Right now, we have ~12 million segments and growing. Usually these are
> sentences but can be other things. These segments are what will be stored
> in Solr. I’ve already done that.
>
> Now, what happens is a user will upload say a word document to us. We then
> parse it and process it into segments. It very well could be 5000 segments
> or even more in that word document. Each one of those ~5000 segments needs
> to be searched for similar segments in solr. I’m not quite sure how I will
> do the query (whether proximate or something else). The point though, is to
> get back similar results for each segment.
>
> However, I think I’m seeing a bigger problem first. I have to search
> against ~5000 segments. That would be 5000 http requests. That’s a lot! I’m
> pretty sure that would take a LOT of hardware. Keep in mind this could be
> happening with maybe 4 different users at once right now (and of course
> more in the future). Is there a good way to send a batch query over one (or
> at least a lot fewer) http requests?
>
> If not, what kinds of things could I do to implement such a feature (if
> feasible, of course)?
>
>
> Thanks,
>
> Mike
>

Reply via email to