Here is the jira that describes it: https://issues.apache.org/jira/browse/SOLR-15049
The description and comments don't describe the full impact on performance of this optimization. When you have a large number of join keys the impact of this optimization is massive. Joel Bernstein http://joelsolr.blogspot.com/ On Mon, May 3, 2021 at 8:38 AM Joel Bernstein <joels...@gmail.com> wrote: > If you are using the latest version of Solr there is a new optimized self > join which is very fast. To use it you must join in the same core and use > the same join field (to and from field must be the same). The > optimization should kick in on it's own with the join qparser plugin and > will be orders of magnitudes faster for large joins. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Fri, Apr 30, 2021 at 11:37 AM Jens Viebig <jens.vie...@vitec.com> > wrote: > >> Tried something today which seems promising. >> >> I have put all documents from both cores in the same collection, sharded >> the collection to 8 shards, routing the documents so all documents with the >> same contentId_s end up on the same shard. >> To distinguish between document types we used a string field with an >> identifier doctype_s:col1 doctype_s:col2 >> (Btw. What would be the best data type for a doc identifier that is fast >> to filter on ?) >> Seems join inside the same core is a) much more efficient and b) seems to >> work with sharded index >> We are currently still running this on a single instance and have >> reasonable response times ~1 sec which would be Ok for us and a big >> improvement over the old state. >> >> - Why is the join that much faster ? Is it because of the sharding or >> also because of the same core ? >> - How can we expect this to scale when adding more documents (probably >> with adding solr instances/shards/replicas on additional servers ) ? >> Doubling/tripling/... the amount of docs >> - Would you expect query times to improve with additional servers and >> solr instances ? >> - What would be the best data type for a doc identifier that is fast to >> filter on to distinguish between different document types on the same >> collection ? >> >> What I don't like about this solution is that we loose the possibility >> completely reindex the a "document type". For example collection1 was >> pretty fast to completely reindex and possibly the schema changes more >> often, while collection2 is "index once / delete after x days" and is heavy >> to reindex. >> >> Best Regards >> Jens >> >> -----Ursprüngliche Nachricht----- >> Von: Jens Viebig <jens.vie...@vitec.com> >> Gesendet: Mittwoch, 28. April 2021 19:12 >> An: users@solr.apache.org >> Betreff: join with big 2nd collection >> >> Hi List, >> >> We have a join perfomance issue and are not sure in which direction we >> should look to solve the issue. >> We currently only have a single node setup >> >> We have 2 collections where we do join querys, joined by a "primary key" >> string field contentId_s Each dataset for a single contentId_s consists of >> multiple timecode based documents in both indexes which makes this a many >> to many query. >> >> collection1 - contains generic metadata and timecode based content (think >> timecode based comments) >> Documents: 382.872 >> Unique contentId_s: 16715 >> ~ 160MB size >> single shard >> >> collection2 - contains timecode based GPS data (gps posititon, field of >> view...timecodes are not related to timecodes in collection1, so flatten >> the structure would blow up the number of documents to incredible numbers) : >> Documents: 695.887.875 >> Unique contenId_s: 10199 >> ~ 300 GB size >> single shard >> >> Hardware is a HP DL360 with 32gb of ram (also tried on a machine with >> 64gb with not much improvement) and 1TB SSD for the index >> >> In our use case there is lots of indexing/deletion traffic on both >> indexes and only few queries fired against the server. >> >> We are constantly indexing new content and deleting old documents. This >> was already getting problematic with HDDs so we switched to SDDs, now >> indexing speed is fine for now (Might need also to scale this up in the >> future to allow more throughput). >> >> But search speed suffers when we need to join with the big collection2 >> (taking up to 30sec for the query to succeed). We had some success >> experimenting with score join queries when collection2 results only returns >> a few unique Ids, but we can't predict that this is always the case, and if >> a lot of documents are hit in collection2, performance is 10x worse than >> with original normal join. >> >> Sample queries look like this (simplified, but more complex queries are >> not much slower): >> >> Sample1: >> query: coll1field:someval OR {!join from=contentId_s to=contentId_s >> fromIndex=collection2 v='coll2field:someval} >> filter: {!collapse field=contentId_s min=timecode_f} >> >> Sample 2: >> query: coll1field:someval >> filter: {!join from=contentId_s to=contentId_s fromIndex=collection2 >> v='coll2field:otherval} >> filter: {!collapse field=contentId_s min=timecode_f} >> >> >> I experimented with running the query on collection2 alone first only to >> get the numdocs (collapsing on contentId_s) to see how much results we get >> so we could choose the right join query, but then with many hits in >> collection2 this almost takes the same time as doing the join, so slow >> queries would get even slower >> >> Caches also seem to not help much since almost every query fired is >> different and the index is mostly changing between requests anyways. >> >> We are open to anything, adding nodes/hardware/shards/changing the index >> structure... >> Currently we don't know how to get around the big join >> >> Any advice in which direction we should look ? >> >