[ 
https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-13399:
----------------------------
    Attachment: ShardSplitTest.master.seed_AE04B5C9BA6E9A4.log.txt
        Status: Reopened  (was: Reopened)


git bisect has identified 19ddcfd282f3b9eccc50da83653674e510229960 as the cause 
of recent (reproducible) jenkins test failures in ShardSplitTest...

https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-NightlyTests-8.x/174/
https://builds.apache.org/view/L/view/Lucene/job/Lucene-Solr-repro/3507/

(Jenkins found the failures on branch_8x, but i was able to reproduce the same 
exact seed on master, and used that branch for bisecting.  Attaching logs from 
my local master run.)

{noformat}
ant test -Dtestcase=ShardSplitTest -Dtests.method=test 
-Dtests.seed=AE04B5C9BA6E9A4 -Dtests.multiplier=2 -Dtests.nightly=true 
-Dtests.slow=true -Dtests.badapples=true  -Dtests.locale=sr-Latn 
-Dtests.timezone=Etc/GMT-11 -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
{noformat}

{noformat}
   [junit4] FAILURE  273s J2 | ShardSplitTest.test <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: Wrong doc count on 
shard1_0. See SOLR-5309 expected:<257> but was:<316>
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([AE04B5C9BA6E9A4:82B47486355A845C]:0)
   [junit4]    >        at 
org.apache.solr.cloud.api.collections.ShardSplitTest.checkDocCountsAndShardStates(ShardSplitTest.java:1002)
   [junit4]    >        at 
org.apache.solr.cloud.api.collections.ShardSplitTest.splitByUniqueKeyTest(ShardSplitTest.java:794)
   [junit4]    >        at 
org.apache.solr.cloud.api.collections.ShardSplitTest.test(ShardSplitTest.java:111)
   [junit4]    >        at 
org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:1082)
   [junit4]    >        at 
org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:1054)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
{noformat}


> compositeId support for shard splitting
> ---------------------------------------
>
>                 Key: SOLR-13399
>                 URL: https://issues.apache.org/jira/browse/SOLR-13399
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Yonik Seeley
>            Assignee: Yonik Seeley
>            Priority: Major
>             Fix For: 8.3
>
>         Attachments: SOLR-13399.patch, SOLR-13399.patch, 
> SOLR-13399_testfix.patch, SOLR-13399_useId.patch, 
> ShardSplitTest.master.seed_AE04B5C9BA6E9A4.log.txt
>
>
> Shard splitting does not currently have a way to automatically take into 
> account the actual distribution (number of documents) in each hash bucket 
> created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* 
> command that would look at the number of docs sharing each compositeId prefix 
> and use that to create roughly equal sized buckets by document count rather 
> than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash 
> buckets unless necessary (since that leads to larger query fanout.) . Perhaps 
> this warrants a parameter that would control how much of a size mismatch is 
> tolerable before resorting to splitting within a bucket. 
> *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index 
> the prefix in a different field.  Iterating over the terms for this field 
> would quickly give us the number of docs in each (i.e lucene keeps track of 
> the doc count for each term already.)  Perhaps the implementation could be a 
> flag on the *id* field... something like *indexPrefixes* and poly-fields that 
> would cause the indexing to be automatically done and alleviate having to 
> pass in an additional field during indexing and during the call to 
> *SPLITSHARD*.  This whole part is an optimization though and could be split 
> off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to