Re: Distributed Search Components

Chris Hostetter Fri, 18 Jun 2010 12:25:56 -0700

: Chris, can you tell me where exactly I can find those implementations you
: are talking about?
: I can't find them, probably I am searching in the wrong code-files.
: 
: I would really like to compare the sourcecodes of both implementations.


I honestly don't know what you mean by "those implementations" and "both 
implementations" ... impls of what?

: On the other hand, maybe my english-skillz are reasonable for my
: missunderstanding of your post.
: Maybe you mean something like "first, ask each search component for the data
: it needs. For example: get the top facets, get their counts, get the normal
: search-results, get the stats".
: And in the second step "now we know what we need, let's ask each node for
: those data and aggregate it. So we can send *one* request instead of 4 or
: more".

Let's use a concrete example ... imagine you are only dealing with 
QueryComponent and FacetComponent, and imagine we have a single 
"coordinatorX" server that we query, and it distributes to two distinct 
shard servers ("shardA" and "shardB")

the first thing QueryComponent on the coordinatorX server cares about is 
asking shardA and shardB for the docIds of the docs they have that match.  
the first thing FacetComponent on coordinatorX cares about is knowing the 
top facet constraints for the matching docs from shardA and shardB -- both 
of those pieces of information can be computed in a single request to each 
shard, in which the shard computes both pieces of information (it's top 
scoring documents and it's facet constraints with the highest counts) in a 
single pass.  When coordinatorX gets those responses back, it's 
QueryComponent can sort the "score,docId,shard" tuples to decide which 
shards it needs to ask for the stored fields of which docIds in order to 
build the final list of matching docs; and coordinatorX's FacetComponent 
can sort the "constraint,sum(shardCounts)" to decide which constraints 
should be in the final response, but since a constraint in that list 
because it had a highcount from shardB might not have been in the initial 
list from shardA, it needs to ask for the final count from shardA.

These subsequent pieces of info for both the QueryComponent and the 
FacetComponent can be fetched from each shard in another single request, 
and although they may not be computed in a single pass, we still only have 
hte overhead of one network request instead of two or more.

On the otherhand, if coordinatorX just dela with shardA and shardB using 
an abstractiong at the Searcher level using something like MultiSearcher, 
then things like distributed faceting would require a *huge* amount of 
network IO as things like using the TermEnums and TermDocs on coordinatorX 
would result in all of that data being streamed from the individual 
(remote) searchers for each shard so the coordinator could execute the 
neccessary counting logic.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Distributed Search Components

Reply via email to