[jira] [Commented] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits

Per Steffensen (JIRA) Sat, 27 Dec 2014 10:04:28 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259438#comment-14259438
 ]


Per Steffensen commented on SOLR-6810:
--------------------------------------

bq. Since dqa.forceSkipGetIds is always true for this new algorithm then 
computing the set X is not necessary and we can just directly fetch all return 
fields from individual shards and return the response to the user. Is that 
correct?

This is what happens by default with the new algorithm. But dqa.forceSkipGetIds 
is not always true. It is true by default, but you can explicitly set it to 
false by sending dqa.forceSkipGetIds=false in your request. So basically there 
are four options
* old alg without dqa.forceSkipGetIds or with dqa.forceSkipGetIds=false 
(default before SOLR-6810, and currently also after SOLR-6810)
* old alg with dqa.forceSkipGetIds=true (same as with distrib.singlePass=true 
before SOLR-6810)
* new alg without dqa.forceSkipGetIds or with dqa.forceSkipGetIds=true (does as 
you describe above)
* new alg with dqa.forceSkipGetIds=false (does as described in the JavaDoc you 
quoted)

The JavaDoc descriptions describe how the alg works WITHOUT dqa.forceSkipGetIds 
switched on. But dqa.forceSkipGetIds is switched on for the new alg by default. 
The JavaDoc for ShardParams.DQA.FORCE_SKIP_GET_IDS_PARAM describes how the two 
algs are altered when running with dqa.forceSkipGetIds=true. The thing is that 
you need to know this part as well to understand how the new alg works by 
default.

bq. I think the DefaultProvider and DefaultDefaultProvider aren't necessary? We 
can just keep a single static ShardParams.getDQA(SolrParams params) method and 
modify it if we ever need to change the default.

Well I would prefer to keep ShardParams.DQA.get(params) instead of having a 
ShardParams.getDQA(params) - it think it is better "context'ing". But I will 
survive if you want to change it.
DefaultProvider in supposed to isolate the default decisions. 
DefaultDefaultProvider is an implementation that calculates the out-of-the-box 
defaults. It could be done directly in ShardParams.DQA.get, but I like to 
structure things. But I have to admit that the main reason I added the 
DefaultProvider thing, was that it makes it easier to change the 
default-decisions made when running the test-suite. I would like to randomly 
select the DQA to be used for every single query fired across the entire 
test-suite. This way we will have a very thorough test-coverage of both algs. 
Having the option of changing the DefaultProvider made it very easy to achieve 
this in SolrTestCaseJ4
{code}
private static DQA.DefaultProvider testDQADefaultProvider =
    new DQA.DefaultProvider() {
      @Override
      public DQA getDefault(SolrParams params) {
        // Select randomly the DQA to use
        int algNo = 
Math.abs(random().nextInt()%(ShardParams.DQA.values().length));
        return DQA.values()[algNo];
      }
    };
{code}
{code}
DQA.setDefaultProvider(testDQADefaultProvider);
{code}

bq. If a user wants to change the default, the dqa can be set in the "defaults" 
section of the search handler.

I know it is a matter of opinion but in my mind the best place to deal with 
default for DQA is in the code that deals with DQA - not somewhere else. This 
makes a much better isolation and it makes code easier to understand. You can 
essentially navigate to ShardParams.DQA and read the code and JavaDoc and 
understand everything about DQA's. You do not have to know that there is a 
decision about default in the SeachHandler. But if you want to change that, it 
is ok for me.

bq. Why do we need the switchToTestDQADefaultProvider() and 
switchToOriginalDQADefaultProvider() methods? You are already applying the DQA 
for each request so why is the switch necessary?

No I am not applying the DQA for each request. I trust you understand why I 
want to run with randomized DQA across the entire test-suite - this is why I 
invented the testDQADefaultProvider. In tests that explicitly deal with testing 
DQA stuff, in some cases I want to switch on the real DefaultProvider because 
some of those tests are actually testing out-of-the-box default-behaviour. E.g. 
verifyForceSkipGetIds-tests in DistributedQueryComponentOptimizationTest. Also 
need it in DistributedExpandComponentTest until SOLR-6813 has been solved.

bq. There's still the ShardParams.purpose field which you added in SOLR-6812 
but I removed it. I still think it is unnecessary for purpose to be sent to 
shard. Is it necessary for this patch or is it just an artifact from SOLR-6812?

You are right. It is a mistake that I did not remove ShardParams.purpose

bq. Did you benchmark it against the current algorithm for other kinds of 
use-cases as well (3-5 shards, small number of rows)? Not asking for id can 
speed up responses there too I think.

I did not do any concrete benchmarking for other requests. We have changed our 
DQA in production for a particular request where it reduces response-time by a 
factor of 60 from minutes/hours to secs/minutes. We want to take it in two 
steps, starting out just switching to the new DQA in the case where we have 
shown that it makes a huge difference. We will soon look into whether or not it 
will help us for all or some of the other queries we do.

The new DQA might help in case of "few" shards or small "rows"-values. What I 
wanted to say is that the speedup from using new DQA will increase in the 
factors mentioned
* The "rows" you ask for (and there are actually many hits on each shard)
* The number of shards searched
I did not mean to say that there is no benefit from the new DQA even with small 
rows-values or few shards. But I do believe that there is a lower limit as to 
when you should apply this DQA. I do not believe it is always better to use the 
new DQA than using "the old DQA + dqa.forceSkipGetIds". Main reason it that 
"the old DQA + dqa.forceSkipGetIds" only does one round-trip to all the shards. 
The new DQA does two round-trips. So for very fast/simple queries the extra 
round-trip might actually just increase total response-time.

{quote}
bq. Also, does this patch also improve things if docValues are used for the ID 
field?

bq. No.

Which begs the question: what are the downsides of using docValues for the ID 
field by default, and are those downsides enough to implement this alternate 
merge implementation?
{quote}

I am not sure exactly when doc-values kick in in the search-flow, so I am not 
sure "No" is the correct answer. But even though id field is doc-value it is 
still cheaper not to fetch it at all than fetching it from doc-value. The other 
issue (maybe only relevant for us) is that it takes a significant amount of 
time making the id field doc-value if it isnt already - we started using 
SolrCloud before doc-value was even possible, so in some of our older systems 
id field in not doc-value. To turn the id field into doc-value I believe you 
currently need to re-index it all - that is not feasible with a thousand 
billion documents. I know that [~toke] has recently done some work making it 
possible to add doc-value to a field without re-indexing it completely from 
scratch (basically just calculating the doc-value data and leaving index/store 
as it already is), but even with that approach it is not something you just do 
in a 24x7 system.

I do believe doc-value and new DQA are orthogonal optimizations.

{quote}
When a different searcher is used (because of a commit) the ordinals could 
refer to different docs.
But this seems to lead to acceptable behavior (unlike using internal docids 
which leads to catastrophic types of fails)
{quote}

For a multi-phase algorithm, if there are commits (changing the result of the 
search) between phases, you will get a response somewhere between "the correct 
response before the commit" and "the correct response after the commit". This 
goes both for the old and the new DQA. Therefore I assumed that it is also 
acceptable for the new DQA. Making multi-phase DQAs always response correctly 
according to the state when the first phase was carried out, will require 
ACID-like transactions with a very strong (serializable'ish) isolation-level - 
do NOT go there for a high performance no-sql database! Remember that there can 
also be deletes in between-phase commits.

bq. return less rows than requested

This can also happen with the old DQA - if deletes happen in the between-phases 
commit. But the new DQA have a problem with duplicates that the old one does 
not. On the other hand the new DQA is more likely to return the correct number 
of documents in case of deletes between phases.

{quote}
bq. Everything I'm thinking of so far leads me to believe the new strategy 
should be the default.

+1
{quote}

Just remember
* The new DQA has an extra rount-trip to shards, compared to "old DQA plus 
dqa.forceSkipGetIds=true". It has the same number of round-trips (2) to shards 
as the current default algorithm though ("old DQA with 
dqa.forceSkipGetIds=false")
* Before SOLR-6813 is fixed, the new DQA does not work for some (very limited) 
expand-request (those expand-requests does not work for "old DQA plus 
dqa.forceSkipGetIds=true" either). We probably want a default DQA to work for 
any request

> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-6810
>                 URL: https://issues.apache.org/jira/browse/SOLR-6810
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Per Steffensen
>            Assignee: Shalin Shekhar Mangar
>              Labels: distributed_search, performance
>         Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, 
> branch_5x_rev1645549.patch
>
>
> Searching "limited but high rows across many shards all with many hits" is 
> slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among 
> the top-1000 from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant 
> number of hits on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on 
> a lot of things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the 
> store to get id’s for (up to) 1000 docs (value of rows parameter) per shard. 
> Having 1000 shards its up to 1 mio ids that has to be fetched. There is 
> really no good reason to ever read information from store for more than the 
> overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows 
> across many shards all with high hits" started 13/11-2014 on 
> [email protected]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits

Reply via email to