[
https://issues.apache.org/jira/browse/SOLR-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586536#comment-16586536
]
Varun Thacker commented on SOLR-12635:
--------------------------------------
There was a concern that if you do rollups on 10 fields in any combination the
partitionKeys combination would be too high and we won't be able to pre-cache
it.
Here's an example Joel discussed offline which shows why partitionKeys should
not have all the fields as your underlying sort and rollup fields. Infact if
you provide more than 4 partitionKeys today we skip the remaining keys for
partitioning silently.
The hash partitioner just needs to send documents to the same worker node. You
could do that with just one partitioning key
For example if you sort on year, month and day. You could partition on year
only and still be fine as long as there was enough different years to spread
the records around the worker nodes.
I'll write up some best practices around using parallel stream on the ref-guide
which talks about warming and how many partitionKeys to use. Closing out this
Jira as Wont-Fix though
> HashQParserPlugin should be run as a post filter cost is not explicitly
> defined
> -------------------------------------------------------------------------------
>
> Key: SOLR-12635
> URL: https://issues.apache.org/jira/browse/SOLR-12635
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Varun Thacker
> Assignee: Varun Thacker
> Priority: Major
> Attachments: SOLR-12635.patch
>
>
> I was doing some performance benchmarking for a user on slow streaming queries
> The weird thing was that same streaming expression was fast when we fired it
> again
> We were able to isolate the slowness to hash query parser
> Here is the first and second time we fired the query - to simplify things
> this is for one shard and for the same worker
> {code:java}
> path=/export
> params={q=*:*&distrib=false&indent=off&fl=fields&fq=user:1&fq={!hash
> workers=6 worker=3}&partitionKeys=partitionKey&sort=partitionKey
> asc&wt=javabin&version=2.2} hits=0 status=0 QTime=6821
> path=/export
> params={q=*:*&distrib=false&indent=off&fl=fields&fq=user:1&fq={!hash
> workers=6 worker=3}&partitionKeys=partitionKey&sort=partitionKey
> asc&wt=javabin&version=2.2} hits=0 status=0 QTime=0{code}
> Even with hits=0 the first query took 6.8 seconds. The shard has 17m
> documents
> The second query utilizes the queryResultCache and hence it's lightening fast
> the second time around.
> When we execute the same query and add a cost i.e {{&fq={!hash workers=6
> worker=3}} cost=101} the query get's executed as a post filter and even
> uncashed is super fast.
> I created this Jira so that we can always set cost > 100 from the parallel
> stream.
> However I am happy to change the default behaviour for HashQParserPlugin and
> make it run as a post filter always unless explicitly specified.
> CollapsingQParserPlugin does this currently to make sure it's run as a post
> filter by default
> {code:java}
> public int getCost() {
> return Math.max(super.getCost(), 100);
> }{code}
> Thoughts anyone?
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]