Hi Joel, Thanks for the hint. I can confirm that joins are much faster with this optimization. I now also tried the same collection with 8 shards vs a single shard all on as single solr node Sharding seems to also improve search speed in our scenario even with a single node.
I think we will experiment with adding more solr nodes with replicas next. Does anyone have insights on how the join is done internally ? From my understanding every shard is doing the join on its own and returning joined results that are then combined to the overall result, is this correct ? With this assumption we expect that adding more nodes/servers could improve query speed ? We also experienced that we needed to add more heap memory otherwise we got an OOM in some cases. Will memory requirements change for a single node when adding more nodes with replicas ? Best Regards Jens -----Ursprüngliche Nachricht----- Von: Joel Bernstein <joels...@gmail.com> Gesendet: Montag, 3. Mai 2021 14:42 An: users@solr.apache.org Betreff: Re: join with big 2nd collection Here is the jira that describes it: https://issues.apache.org/jira/browse/SOLR-15049 The description and comments don't describe the full impact on performance of this optimization. When you have a large number of join keys the impact of this optimization is massive. Joel Bernstein http://joelsolr.blogspot.com/ On Mon, May 3, 2021 at 8:38 AM Joel Bernstein <joels...@gmail.com> wrote: > If you are using the latest version of Solr there is a new optimized > self join which is very fast. To use it you must join in the same core > and use the same join field (to and from field must be the same). The > optimization should kick in on it's own with the join qparser plugin > and will be orders of magnitudes faster for large joins. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Fri, Apr 30, 2021 at 11:37 AM Jens Viebig <jens.vie...@vitec.com> > wrote: > >> Tried something today which seems promising. >> >> I have put all documents from both cores in the same collection, >> sharded the collection to 8 shards, routing the documents so all >> documents with the same contentId_s end up on the same shard. >> To distinguish between document types we used a string field with an >> identifier doctype_s:col1 doctype_s:col2 (Btw. What would be the >> best data type for a doc identifier that is fast to filter on ?) >> Seems join inside the same core is a) much more efficient and b) >> seems to work with sharded index We are currently still running this >> on a single instance and have reasonable response times ~1 sec which >> would be Ok for us and a big improvement over the old state. >> >> - Why is the join that much faster ? Is it because of the sharding or >> also because of the same core ? >> - How can we expect this to scale when adding more documents >> (probably with adding solr instances/shards/replicas on additional servers ) >> ? >> Doubling/tripling/... the amount of docs >> - Would you expect query times to improve with additional servers and >> solr instances ? >> - What would be the best data type for a doc identifier that is fast >> to filter on to distinguish between different document types on the >> same collection ? >> >> What I don't like about this solution is that we loose the >> possibility completely reindex the a "document type". For example >> collection1 was pretty fast to completely reindex and possibly the >> schema changes more often, while collection2 is "index once / delete >> after x days" and is heavy to reindex. >> >> Best Regards >> Jens >> >> -----Ursprüngliche Nachricht----- >> Von: Jens Viebig <jens.vie...@vitec.com> >> Gesendet: Mittwoch, 28. April 2021 19:12 >> An: users@solr.apache.org >> Betreff: join with big 2nd collection >> >> Hi List, >> >> We have a join perfomance issue and are not sure in which direction >> we should look to solve the issue. >> We currently only have a single node setup >> >> We have 2 collections where we do join querys, joined by a "primary key" >> string field contentId_s Each dataset for a single contentId_s >> consists of multiple timecode based documents in both indexes which >> makes this a many to many query. >> >> collection1 - contains generic metadata and timecode based content >> (think timecode based comments) >> Documents: 382.872 >> Unique contentId_s: 16715 >> ~ 160MB size >> single shard >> >> collection2 - contains timecode based GPS data (gps posititon, field >> of view...timecodes are not related to timecodes in collection1, so >> flatten the structure would blow up the number of documents to incredible >> numbers) : >> Documents: 695.887.875 >> Unique contenId_s: 10199 >> ~ 300 GB size >> single shard >> >> Hardware is a HP DL360 with 32gb of ram (also tried on a machine with >> 64gb with not much improvement) and 1TB SSD for the index >> >> In our use case there is lots of indexing/deletion traffic on both >> indexes and only few queries fired against the server. >> >> We are constantly indexing new content and deleting old documents. >> This was already getting problematic with HDDs so we switched to >> SDDs, now indexing speed is fine for now (Might need also to scale >> this up in the future to allow more throughput). >> >> But search speed suffers when we need to join with the big >> collection2 (taking up to 30sec for the query to succeed). We had >> some success experimenting with score join queries when collection2 >> results only returns a few unique Ids, but we can't predict that this >> is always the case, and if a lot of documents are hit in collection2, >> performance is 10x worse than with original normal join. >> >> Sample queries look like this (simplified, but more complex queries >> are not much slower): >> >> Sample1: >> query: coll1field:someval OR {!join from=contentId_s to=contentId_s >> fromIndex=collection2 v='coll2field:someval} >> filter: {!collapse field=contentId_s min=timecode_f} >> >> Sample 2: >> query: coll1field:someval >> filter: {!join from=contentId_s to=contentId_s fromIndex=collection2 >> v='coll2field:otherval} >> filter: {!collapse field=contentId_s min=timecode_f} >> >> >> I experimented with running the query on collection2 alone first only >> to get the numdocs (collapsing on contentId_s) to see how much >> results we get so we could choose the right join query, but then with >> many hits in >> collection2 this almost takes the same time as doing the join, so >> slow queries would get even slower >> >> Caches also seem to not help much since almost every query fired is >> different and the index is mostly changing between requests anyways. >> >> We are open to anything, adding nodes/hardware/shards/changing the >> index structure... >> Currently we don't know how to get around the big join >> >> Any advice in which direction we should look ? >> >