[
https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259438#comment-14259438
]
Per Steffensen commented on SOLR-6810:
--------------------------------------
bq. Since dqa.forceSkipGetIds is always true for this new algorithm then
computing the set X is not necessary and we can just directly fetch all return
fields from individual shards and return the response to the user. Is that
correct?
This is what happens by default with the new algorithm. But dqa.forceSkipGetIds
is not always true. It is true by default, but you can explicitly set it to
false by sending dqa.forceSkipGetIds=false in your request. So basically there
are four options
* old alg without dqa.forceSkipGetIds or with dqa.forceSkipGetIds=false
(default before SOLR-6810, and currently also after SOLR-6810)
* old alg with dqa.forceSkipGetIds=true (same as with distrib.singlePass=true
before SOLR-6810)
* new alg without dqa.forceSkipGetIds or with dqa.forceSkipGetIds=true (does as
you describe above)
* new alg with dqa.forceSkipGetIds=false (does as described in the JavaDoc you
quoted)
The JavaDoc descriptions describe how the alg works WITHOUT dqa.forceSkipGetIds
switched on. But dqa.forceSkipGetIds is switched on for the new alg by default.
The JavaDoc for ShardParams.DQA.FORCE_SKIP_GET_IDS_PARAM describes how the two
algs are altered when running with dqa.forceSkipGetIds=true. The thing is that
you need to know this part as well to understand how the new alg works by
default.
bq. I think the DefaultProvider and DefaultDefaultProvider aren't necessary? We
can just keep a single static ShardParams.getDQA(SolrParams params) method and
modify it if we ever need to change the default.
Well I would prefer to keep ShardParams.DQA.get(params) instead of having a
ShardParams.getDQA(params) - it think it is better "context'ing". But I will
survive if you want to change it.
DefaultProvider in supposed to isolate the default decisions.
DefaultDefaultProvider is an implementation that calculates the out-of-the-box
defaults. It could be done directly in ShardParams.DQA.get, but I like to
structure things. But I have to admit that the main reason I added the
DefaultProvider thing, was that it makes it easier to change the
default-decisions made when running the test-suite. I would like to randomly
select the DQA to be used for every single query fired across the entire
test-suite. This way we will have a very thorough test-coverage of both algs.
Having the option of changing the DefaultProvider made it very easy to achieve
this in SolrTestCaseJ4
{code}
private static DQA.DefaultProvider testDQADefaultProvider =
new DQA.DefaultProvider() {
@Override
public DQA getDefault(SolrParams params) {
// Select randomly the DQA to use
int algNo =
Math.abs(random().nextInt()%(ShardParams.DQA.values().length));
return DQA.values()[algNo];
}
};
{code}
{code}
DQA.setDefaultProvider(testDQADefaultProvider);
{code}
bq. If a user wants to change the default, the dqa can be set in the "defaults"
section of the search handler.
I know it is a matter of opinion but in my mind the best place to deal with
default for DQA is in the code that deals with DQA - not somewhere else. This
makes a much better isolation and it makes code easier to understand. You can
essentially navigate to ShardParams.DQA and read the code and JavaDoc and
understand everything about DQA's. You do not have to know that there is a
decision about default in the SeachHandler. But if you want to change that, it
is ok for me.
bq. Why do we need the switchToTestDQADefaultProvider() and
switchToOriginalDQADefaultProvider() methods? You are already applying the DQA
for each request so why is the switch necessary?
No I am not applying the DQA for each request. I trust you understand why I
want to run with randomized DQA across the entire test-suite - this is why I
invented the testDQADefaultProvider. In tests that explicitly deal with testing
DQA stuff, in some cases I want to switch on the real DefaultProvider because
some of those tests are actually testing out-of-the-box default-behaviour. E.g.
verifyForceSkipGetIds-tests in DistributedQueryComponentOptimizationTest. Also
need it in DistributedExpandComponentTest until SOLR-6813 has been solved.
bq. There's still the ShardParams.purpose field which you added in SOLR-6812
but I removed it. I still think it is unnecessary for purpose to be sent to
shard. Is it necessary for this patch or is it just an artifact from SOLR-6812?
You are right. It is a mistake that I did not remove ShardParams.purpose
bq. Did you benchmark it against the current algorithm for other kinds of
use-cases as well (3-5 shards, small number of rows)? Not asking for id can
speed up responses there too I think.
I did not do any concrete benchmarking for other requests. We have changed our
DQA in production for a particular request where it reduces response-time by a
factor of 60 from minutes/hours to secs/minutes. We want to take it in two
steps, starting out just switching to the new DQA in the case where we have
shown that it makes a huge difference. We will soon look into whether or not it
will help us for all or some of the other queries we do.
The new DQA might help in case of "few" shards or small "rows"-values. What I
wanted to say is that the speedup from using new DQA will increase in the
factors mentioned
* The "rows" you ask for (and there are actually many hits on each shard)
* The number of shards searched
I did not mean to say that there is no benefit from the new DQA even with small
rows-values or few shards. But I do believe that there is a lower limit as to
when you should apply this DQA. I do not believe it is always better to use the
new DQA than using "the old DQA + dqa.forceSkipGetIds". Main reason it that
"the old DQA + dqa.forceSkipGetIds" only does one round-trip to all the shards.
The new DQA does two round-trips. So for very fast/simple queries the extra
round-trip might actually just increase total response-time.
{quote}
bq. Also, does this patch also improve things if docValues are used for the ID
field?
bq. No.
Which begs the question: what are the downsides of using docValues for the ID
field by default, and are those downsides enough to implement this alternate
merge implementation?
{quote}
I am not sure exactly when doc-values kick in in the search-flow, so I am not
sure "No" is the correct answer. But even though id field is doc-value it is
still cheaper not to fetch it at all than fetching it from doc-value. The other
issue (maybe only relevant for us) is that it takes a significant amount of
time making the id field doc-value if it isnt already - we started using
SolrCloud before doc-value was even possible, so in some of our older systems
id field in not doc-value. To turn the id field into doc-value I believe you
currently need to re-index it all - that is not feasible with a thousand
billion documents. I know that [~toke] has recently done some work making it
possible to add doc-value to a field without re-indexing it completely from
scratch (basically just calculating the doc-value data and leaving index/store
as it already is), but even with that approach it is not something you just do
in a 24x7 system.
I do believe doc-value and new DQA are orthogonal optimizations.
{quote}
When a different searcher is used (because of a commit) the ordinals could
refer to different docs.
But this seems to lead to acceptable behavior (unlike using internal docids
which leads to catastrophic types of fails)
{quote}
For a multi-phase algorithm, if there are commits (changing the result of the
search) between phases, you will get a response somewhere between "the correct
response before the commit" and "the correct response after the commit". This
goes both for the old and the new DQA. Therefore I assumed that it is also
acceptable for the new DQA. Making multi-phase DQAs always response correctly
according to the state when the first phase was carried out, will require
ACID-like transactions with a very strong (serializable'ish) isolation-level -
do NOT go there for a high performance no-sql database! Remember that there can
also be deletes in between-phase commits.
bq. return less rows than requested
This can also happen with the old DQA - if deletes happen in the between-phases
commit. But the new DQA have a problem with duplicates that the old one does
not. On the other hand the new DQA is more likely to return the correct number
of documents in case of deletes between phases.
{quote}
bq. Everything I'm thinking of so far leads me to believe the new strategy
should be the default.
+1
{quote}
Just remember
* The new DQA has an extra rount-trip to shards, compared to "old DQA plus
dqa.forceSkipGetIds=true". It has the same number of round-trips (2) to shards
as the current default algorithm though ("old DQA with
dqa.forceSkipGetIds=false")
* Before SOLR-6813 is fixed, the new DQA does not work for some (very limited)
expand-request (those expand-requests does not work for "old DQA plus
dqa.forceSkipGetIds=true" either). We probably want a default DQA to work for
any request
> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>
> Key: SOLR-6810
> URL: https://issues.apache.org/jira/browse/SOLR-6810
> Project: Solr
> Issue Type: Improvement
> Components: search
> Reporter: Per Steffensen
> Assignee: Shalin Shekhar Mangar
> Labels: distributed_search, performance
> Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch,
> branch_5x_rev1645549.patch
>
>
> Searching "limited but high rows across many shards all with many hits" is
> slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among
> the top-1000 from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant
> number of hits on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on
> a lot of things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the
> store to get id’s for (up to) 1000 docs (value of rows parameter) per shard.
> Having 1000 shards its up to 1 mio ids that has to be fetched. There is
> really no good reason to ever read information from store for more than the
> overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows
> across many shards all with high hits" started 13/11-2014 on
> [email protected]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]