[ https://issues.apache.org/jira/browse/SOLR-16108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509773#comment-17509773 ]
Marc Brette commented on SOLR-16108: ------------------------------------ I spent sometimes analyzing but did not have bandwidth to fix it (and finally decided to use id-based routing instead of router.field based routing). Those notes could be helpful if anyone like to tackle it. There are actually 2 issues - and maybe more down the line. * First in org.apache.solr.handler.admin.SplitOp#getHashHistogramFromId: * this code computes the hash ranges that the shards should have after the split * the code does not use router.field. It also assume unicity of terms in the field used for computing hash (a simple fix is to use termsEnum.docFreq() there) * Second in org.apache.solr.update.SolrIndexSplitter#split: * this code actually performs the split based on the hash range computed above. * here again even though it looks up the router.field, the logic to find the hash of the documents is incorrect. > Incorrect distribution of records in shards after a split with > splitByKeyprefix,when using the CompositeId router with a router field defined > --------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-16108 > URL: https://issues.apache.org/jira/browse/SOLR-16108 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 8.4 > Reporter: Marc Brette > Priority: Major > > When a collection is created using the CompositeId router with a router field > defined, and one of its shard contains records with the same routing key, and > a split of its shard is performed with splitByKeyprefix parameter, we expect > the records to be uniformly distributed between the two resulting shards. > Instead, one shard contains no record, the other contains all the records. > Steps to reproduce: > {code:java} > docker network create solr-network > # run in one terminal > docker run -it -h solr1 --name solr1 --net solr-network -p 18983:8983 > solr:8.4 /opt/solr/bin/solr -c -f > # run in another terminal > docker run -it -h solr2 --name solr2 --net solr-network -p 28983:8983 > solr:8.4 /opt/solr/bin/solr -c -f -z solr1:9983 > #----------------------------------------------------------------------------------------------- > # Works, documents are split between the 2 shards > # Create collection with default compositeId router, routing key in the id, > only one shard > curl --request GET \ > --url > 'http://localhost:18983/solr/admin/collections?action=CREATE&name=routing_by_id&numShards=1' > # Create enough documents, they all have the same routing key (france!) > for i in {0..100} > do > curl --request POST \ > --url > http://localhost:18983/solr/routing_by_id/update/json/docs?commit=true \ > --header 'Content-Type: application/json' \ > --data "[{ > \"id\": \"france\!${i}0\", > \"title_t\": \"hi\" > }]" > done > # Check it is indexed correctly > curl --request GET \ > --url 'http://localhost:18983/solr/routing_by_id/select?q=*%3A*' > # Split the shard > curl --request GET \ > --url > 'http://localhost:18983/solr/admin/collections?action=SPLITSHARD&collection=routing_by_id&shard=shard1&splitByPrefix=true' > # Check records in shard1_0 (~half of the documents there) > curl --request GET \ > --url > 'http://localhost:18983/solr/routing_by_id/select?q=*%3A*&shards=shard1_0' > # Check records in shard1_1(~half of the documents there) > curl --request GET \ > --url > 'http://localhost:18983/solr/routing_by_id/select?q=*%3A*&shards=shard1_1' > #----------------------------------------------------------------------------------------------- > # Fails, does not split documents in both shards > # Create collection with default compositeId router, routing key in the field > "route_t", only one shard > curl --request GET \ > --url > 'http://localhost:18983/solr/admin/collections?action=CREATE&name=routing_by_field&numShards=1&router.field=route_t' > # Create enough documents, they all have the same routing key (france!) > for i in {0..100} > do > curl --request POST \ > --url > http://localhost:18983/solr/routing_by_field/update/json/docs?commit=true \ > --header 'Content-Type: application/json' \ > --data "[{ > \"id\": \"${i}0\", > \"title_t\": \"hi\", > \"route_t\": \"france\" > }]" > done > # Check it is indexed correctly > curl --request GET \ > --url 'http://localhost:18983/solr/routing_by_field/select?q=*%3A*' > # Split the shard > curl --request GET \ > --url > 'http://localhost:18983/solr/admin/collections?action=SPLITSHARD&collection=routing_by_field&shard=shard1&splitByPrefix=true' > # Check records in shard1_0: no document! > curl --request GET \ > --url > 'http://localhost:18983/solr/routing_by_field/select?q=*%3A*&shards=shard1_0' > # Check records in shard1_1: all documents! > curl --request GET \ > --url > 'http://localhost:18983/solr/routing_by_field/select?q=*%3A*&shards=shard1_1' > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org