AW: join with big 2nd collection

Jens Viebig Wed, 05 May 2021 02:02:52 -0700

Hi Joel,

Thanks for the hint. I can confirm that joins are much faster with this 
optimization.
I now also tried the same collection with 8 shards vs a single shard all on as 
single solr node
Sharding seems to also improve search speed in our scenario even with a single 
node.


I think we will experiment with adding more solr nodes with replicas next. 

Does anyone have insights on how the join is done internally ? 
From my understanding every shard is doing the join on its own and returning 
joined results that are then combined to the overall result, is this correct ?
With this assumption we expect that adding more nodes/servers could improve 
query speed ?
We also experienced that we needed to add more heap memory otherwise we got an 
OOM in some cases. 
Will memory requirements change for a single node when adding more nodes with 
replicas ?

Best Regards
Jens

-----Ursprüngliche Nachricht-----
Von: Joel Bernstein <joels...@gmail.com> 
Gesendet: Montag, 3. Mai 2021 14:42
An: users@solr.apache.org
Betreff: Re: join with big 2nd collection

Here is the jira that describes it:

https://issues.apache.org/jira/browse/SOLR-15049

The description and comments don't describe the full impact on performance of 
this optimization. When you have a large number of join keys the impact of this 
optimization is massive.


Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, May 3, 2021 at 8:38 AM Joel Bernstein <joels...@gmail.com> wrote:

> If you are using the latest version of Solr there is a new optimized 
> self join which is very fast. To use it you must join in the same core 
> and use the same join field (to and from field must be the same). The 
> optimization should kick in on it's own with the join qparser plugin 
> and will be orders of magnitudes faster for large joins.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Apr 30, 2021 at 11:37 AM Jens Viebig <jens.vie...@vitec.com>
> wrote:
>
>> Tried something today which seems promising.
>>
>> I have put all documents from both cores in the same collection, 
>> sharded the collection to 8 shards, routing the documents so all 
>> documents with the same contentId_s end up on the same shard.
>> To distinguish between document types we used a string field with an 
>> identifier doctype_s:col1  doctype_s:col2 (Btw. What would be the 
>> best data type for a doc identifier that is fast to filter on ?) 
>> Seems join inside the same core is a) much more efficient and b) 
>> seems to work with sharded index We are currently still running this 
>> on a single instance and have reasonable response times ~1 sec which 
>> would be Ok for us and a big improvement over the old state.
>>
>> - Why is the join that much faster ? Is it because of the sharding or 
>> also because of the same core ?
>> - How can we expect this to scale when adding more documents 
>> (probably with adding solr instances/shards/replicas on additional servers ) 
>> ?
>> Doubling/tripling/... the amount of docs
>> - Would you expect query times to improve with additional servers and 
>> solr instances ?
>> - What would be the best data type for a doc identifier that is fast 
>> to filter on to distinguish between different document types on the 
>> same collection ?
>>
>> What I don't like about this solution is that we loose the 
>> possibility completely reindex the a "document type". For example 
>> collection1 was pretty fast to completely reindex and possibly the 
>> schema changes more often, while collection2 is "index once / delete 
>> after x days" and is heavy to reindex.
>>
>> Best Regards
>> Jens
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Jens Viebig <jens.vie...@vitec.com>
>> Gesendet: Mittwoch, 28. April 2021 19:12
>> An: users@solr.apache.org
>> Betreff: join with big 2nd collection
>>
>> Hi List,
>>
>> We have a join perfomance issue and are not sure in which direction 
>> we should look to solve the issue.
>> We currently only have a single node setup
>>
>> We have 2 collections where we do join querys, joined by a "primary key"
>> string field contentId_s Each dataset for a single contentId_s 
>> consists of multiple timecode based documents in both indexes which 
>> makes this a many to many query.
>>
>> collection1 - contains generic metadata and timecode based content 
>> (think timecode based comments)
>> Documents: 382.872
>> Unique contentId_s: 16715
>> ~ 160MB size
>> single shard
>>
>> collection2 - contains timecode based GPS data (gps posititon, field 
>> of view...timecodes are not related to timecodes in collection1, so 
>> flatten the structure would blow up the number of documents to incredible 
>> numbers) :
>> Documents: 695.887.875
>> Unique contenId_s: 10199
>> ~ 300 GB size
>> single shard
>>
>> Hardware is a HP DL360 with 32gb of ram (also tried on a machine with 
>> 64gb with not much improvement) and 1TB SSD for the index
>>
>> In our use case there is lots of indexing/deletion traffic on both 
>> indexes and only few queries fired against the server.
>>
>> We are constantly indexing new content and deleting old documents. 
>> This was already getting problematic with HDDs so we switched to 
>> SDDs, now indexing speed is fine for now (Might need also to scale 
>> this up in the future to allow more throughput).
>>
>> But search speed suffers when we need to join with the big 
>> collection2 (taking up to 30sec for the query to succeed). We had 
>> some success experimenting with score join queries when collection2 
>> results only returns a few unique Ids, but we can't predict that this 
>> is always the case, and if a lot of documents are hit in collection2, 
>> performance is 10x worse than with original normal join.
>>
>> Sample queries look like this (simplified, but more complex queries 
>> are not much slower):
>>
>> Sample1:
>> query: coll1field:someval OR {!join from=contentId_s to=contentId_s
>> fromIndex=collection2 v='coll2field:someval}
>> filter: {!collapse field=contentId_s min=timecode_f}
>>
>> Sample 2:
>> query: coll1field:someval
>> filter: {!join from=contentId_s to=contentId_s fromIndex=collection2 
>> v='coll2field:otherval}
>> filter: {!collapse field=contentId_s min=timecode_f}
>>
>>
>> I experimented with running the query on collection2 alone first only 
>> to get the numdocs (collapsing on contentId_s) to see how much 
>> results we get so we could choose the right join query, but then with 
>> many hits in
>> collection2 this almost takes the same time as doing the join, so 
>> slow queries would get even slower
>>
>> Caches also seem to not help much since almost every query fired is 
>> different and the index is mostly changing between requests anyways.
>>
>> We are open to anything, adding nodes/hardware/shards/changing the 
>> index structure...
>> Currently we don't know how to get around the big join
>>
>> Any advice in which direction we should look ?
>>
>

AW: join with big 2nd collection

Reply via email to