AW: join with big 2nd collection

Jens Viebig Fri, 30 Apr 2021 08:37:38 -0700

Tried something today which seems promising.

I have put all documents from both cores in the same collection, sharded the 
collection to 8 shards, routing the documents so all documents with the same 
contentId_s end up on the same shard. 
To distinguish between document types we used a string field with an identifier 
doctype_s:col1  doctype_s:col2 
(Btw. What would be the best data type for a doc identifier that is fast to 
filter on ?)
Seems join inside the same core is a) much more efficient and b) seems to work 
with sharded index
We are currently still running this on a single instance and have reasonable 
response times ~1 sec which would be Ok for us and a big improvement over the 
old state.


- Why is the join that much faster ? Is it because of the sharding or also 
because of the same core ?
- How can we expect this to scale when adding more documents (probably with 
adding solr instances/shards/replicas on additional servers ) ? 
Doubling/tripling/... the amount of docs
- Would you expect query times to improve with additional servers and solr 
instances ?
- What would be the best data type for a doc identifier that is fast to filter 
on to distinguish between different document types on the same collection ?

What I don't like about this solution is that we loose the possibility 
completely reindex the a "document type". For example collection1 was pretty 
fast to completely reindex and possibly the schema changes more often, while 
collection2 is "index once / delete after x days" and is heavy to reindex.

Best Regards
Jens

-----Ursprüngliche Nachricht-----
Von: Jens Viebig <jens.vie...@vitec.com> 
Gesendet: Mittwoch, 28. April 2021 19:12
An: users@solr.apache.org
Betreff: join with big 2nd collection

Hi List,

We have a join perfomance issue and are not sure in which direction we should 
look to solve the issue.
We currently only have a single node setup

We have 2 collections where we do join querys, joined by a "primary key" string 
field contentId_s Each dataset for a single contentId_s consists of multiple 
timecode based documents in both indexes which makes this a many to many query.

collection1 - contains generic metadata and timecode based content (think 
timecode based comments)
Documents: 382.872
Unique contentId_s: 16715
~ 160MB size
single shard

collection2 - contains timecode based GPS data (gps posititon, field of 
view...timecodes are not related to timecodes in collection1, so flatten the 
structure would blow up the number of documents to incredible numbers) :
Documents: 695.887.875
Unique contenId_s: 10199
~ 300 GB size
single shard

Hardware is a HP DL360 with 32gb of ram (also tried on a machine with 64gb with 
not much improvement) and 1TB SSD for the index

In our use case there is lots of indexing/deletion traffic on both indexes and 
only few queries fired against the server.

We are constantly indexing new content and deleting old documents. This was 
already getting problematic with HDDs so we switched to SDDs, now indexing 
speed is fine for now (Might need also to scale this up in the future to allow 
more throughput).

But search speed suffers when we need to join with the big collection2 (taking 
up to 30sec for the query to succeed). We had some success experimenting with 
score join queries when collection2 results only returns a few unique Ids, but 
we can't predict that this is always the case, and if a lot of documents are 
hit in collection2, performance is 10x worse than with original normal join.

Sample queries look like this (simplified, but more complex queries are not 
much slower):

Sample1:
query: coll1field:someval OR {!join from=contentId_s to=contentId_s 
fromIndex=collection2 v='coll2field:someval}
filter: {!collapse field=contentId_s min=timecode_f}

Sample 2:
query: coll1field:someval
filter: {!join from=contentId_s to=contentId_s fromIndex=collection2 
v='coll2field:otherval}
filter: {!collapse field=contentId_s min=timecode_f}


I experimented with running the query on collection2 alone first only to get 
the numdocs (collapsing on contentId_s) to see how much results we get so we 
could choose the right join query, but then with many hits in collection2 this 
almost takes the same time as doing the join, so slow queries would get even 
slower

Caches also seem to not help much since almost every query fired is different 
and the index is mostly changing between requests anyways.

We are open to anything, adding nodes/hardware/shards/changing the index 
structure...
Currently we don't know how to get around the big join

Any advice in which direction we should look ?

AW: join with big 2nd collection

Reply via email to