Re: join with big 2nd collection

Joel Bernstein Mon, 03 May 2021 05:42:03 -0700

Here is the jira that describes it:

https://issues.apache.org/jira/browse/SOLR-15049


The description and comments don't describe the full impact on performance
of this optimization. When you have a large number of join keys the impact
of this optimization is massive.


Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, May 3, 2021 at 8:38 AM Joel Bernstein <joels...@gmail.com> wrote:

> If you are using the latest version of Solr there is a new optimized self
> join which is very fast. To use it you must join in the same core and use
> the same join field (to and from field must be the same). The
> optimization should kick in on it's own with the join qparser plugin and
> will be orders of magnitudes faster for large joins.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Apr 30, 2021 at 11:37 AM Jens Viebig <jens.vie...@vitec.com>
> wrote:
>
>> Tried something today which seems promising.
>>
>> I have put all documents from both cores in the same collection, sharded
>> the collection to 8 shards, routing the documents so all documents with the
>> same contentId_s end up on the same shard.
>> To distinguish between document types we used a string field with an
>> identifier doctype_s:col1  doctype_s:col2
>> (Btw. What would be the best data type for a doc identifier that is fast
>> to filter on ?)
>> Seems join inside the same core is a) much more efficient and b) seems to
>> work with sharded index
>> We are currently still running this on a single instance and have
>> reasonable response times ~1 sec which would be Ok for us and a big
>> improvement over the old state.
>>
>> - Why is the join that much faster ? Is it because of the sharding or
>> also because of the same core ?
>> - How can we expect this to scale when adding more documents (probably
>> with adding solr instances/shards/replicas on additional servers ) ?
>> Doubling/tripling/... the amount of docs
>> - Would you expect query times to improve with additional servers and
>> solr instances ?
>> - What would be the best data type for a doc identifier that is fast to
>> filter on to distinguish between different document types on the same
>> collection ?
>>
>> What I don't like about this solution is that we loose the possibility
>> completely reindex the a "document type". For example collection1 was
>> pretty fast to completely reindex and possibly the schema changes more
>> often, while collection2 is "index once / delete after x days" and is heavy
>> to reindex.
>>
>> Best Regards
>> Jens
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Jens Viebig <jens.vie...@vitec.com>
>> Gesendet: Mittwoch, 28. April 2021 19:12
>> An: users@solr.apache.org
>> Betreff: join with big 2nd collection
>>
>> Hi List,
>>
>> We have a join perfomance issue and are not sure in which direction we
>> should look to solve the issue.
>> We currently only have a single node setup
>>
>> We have 2 collections where we do join querys, joined by a "primary key"
>> string field contentId_s Each dataset for a single contentId_s consists of
>> multiple timecode based documents in both indexes which makes this a many
>> to many query.
>>
>> collection1 - contains generic metadata and timecode based content (think
>> timecode based comments)
>> Documents: 382.872
>> Unique contentId_s: 16715
>> ~ 160MB size
>> single shard
>>
>> collection2 - contains timecode based GPS data (gps posititon, field of
>> view...timecodes are not related to timecodes in collection1, so flatten
>> the structure would blow up the number of documents to incredible numbers) :
>> Documents: 695.887.875
>> Unique contenId_s: 10199
>> ~ 300 GB size
>> single shard
>>
>> Hardware is a HP DL360 with 32gb of ram (also tried on a machine with
>> 64gb with not much improvement) and 1TB SSD for the index
>>
>> In our use case there is lots of indexing/deletion traffic on both
>> indexes and only few queries fired against the server.
>>
>> We are constantly indexing new content and deleting old documents. This
>> was already getting problematic with HDDs so we switched to SDDs, now
>> indexing speed is fine for now (Might need also to scale this up in the
>> future to allow more throughput).
>>
>> But search speed suffers when we need to join with the big collection2
>> (taking up to 30sec for the query to succeed). We had some success
>> experimenting with score join queries when collection2 results only returns
>> a few unique Ids, but we can't predict that this is always the case, and if
>> a lot of documents are hit in collection2, performance is 10x worse than
>> with original normal join.
>>
>> Sample queries look like this (simplified, but more complex queries are
>> not much slower):
>>
>> Sample1:
>> query: coll1field:someval OR {!join from=contentId_s to=contentId_s
>> fromIndex=collection2 v='coll2field:someval}
>> filter: {!collapse field=contentId_s min=timecode_f}
>>
>> Sample 2:
>> query: coll1field:someval
>> filter: {!join from=contentId_s to=contentId_s fromIndex=collection2
>> v='coll2field:otherval}
>> filter: {!collapse field=contentId_s min=timecode_f}
>>
>>
>> I experimented with running the query on collection2 alone first only to
>> get the numdocs (collapsing on contentId_s) to see how much results we get
>> so we could choose the right join query, but then with many hits in
>> collection2 this almost takes the same time as doing the join, so slow
>> queries would get even slower
>>
>> Caches also seem to not help much since almost every query fired is
>> different and the index is mostly changing between requests anyways.
>>
>> We are open to anything, adding nodes/hardware/shards/changing the index
>> structure...
>> Currently we don't know how to get around the big join
>>
>> Any advice in which direction we should look ?
>>
>

Re: join with big 2nd collection

Reply via email to