[
https://issues.apache.org/jira/browse/SOLR-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15111065#comment-15111065
]
Yonik Seeley commented on SOLR-8129:
------------------------------------
Here's a visualization of a recent fail:
Node A starts off as the leader, gets a bunch of updates that it ever sends to
node B before it is killed.
Node B becomes the leader.
Node A comes up, does a PeerSync and the lists pretty much overlap in time
(looking at low threshold / high threshold only), so node A asks node B for the
docs it's missing (and ends up with a lot more docs than node B).
The list below is ordered from oldest to newest:
{code}
1523440456046739456 B
1523440456047788032 B
1523440456049885184 B
1523440456050933760 B
1523440456051982336 B
1523440456053030912 B
1523440456053030913 B
1523440456053030914 B
1523440456054079488 B
1523440456055128064 B
1523440456059322368 B
1523440456314126336 B
1523440456314126337 B
1523440456315174912 B
1523440456316223488 B
1523440456318320640 B
1523440456342437888 B
1523440456343486464 B
1523440456343486465 B
1523440456344535040 B
1523440456362360832 B
1523440456363409408 A
1523440456372846592 A B
1523440456375992320 A B
1523440456375992321 A B
1523440456379138048 A
1523440456381235200 A
1523440456382283776 A B
1523440456392769536 A
1523440456401158144 A
1523440456403255296 A B
1523440456437858304 A
1523440456463024128 A
1523440456472461312 A
1523440456480849920 A
1523440456531181568 A
1523440456543764480 A
1523440456544813056 A
1523440456544813057 A
1523440456545861632 A
1523440456550055936 A B
1523440456552153088 A B
1523440456552153089 A
1523440456559493120 A B
1523440456561590272 A B
1523440456561590273 A B
1523440456562638848 A B
1523440456563687424 A B
1523440456565784576 A
1523440456609824768 A
1523440456610873344 A
1523440456610873345 A
1523440456611921920 A
1523440456669593600 A
1523440456669593601 A
1523440456669593602 A
1523440456670642176 A
1523440456671690752 A B
1523440456672739328 A
1523440456673787904 A
1523440456674836480 A
1523440456675885056 A
1523440456686370816 A
1523440456690565120 A
1523440456702099456 A B
1523440456726216704 A B
1523440456772354048 A B
1523440456785985536 A B
1523440456826880000 A B
1523440456857288704 A B
1523440456858337280 A B
1523440456921251840 A
1523440456921251841 A
1523440456922300416 A
1523440456926494720 A B
1523440456926494721 A B
1523440456927543296 A
1523440456927543297 A
1523440456929640448 A B
1523440456929640449 A
1523440456934883328 A
1523440456944320512 A
1523440456950611968 A
1523440456975777792 A
1523440456975777793 A
1523440456975777794 A
1523440456976826368 A
1523440456976826369 A
1523440456976826370 A
1523440456999895040 A
1523440457004089344 A
1523440457008283648 A
1523440457009332224 A
1523440457009332225 A
1523440457010380800 A
1523440457056518144 A B
1523440457064906752 A B
1523440457065955328 A B
1523440457067003904 A B
1523440457070149632 A B
1523440457071198208 A B
1523440457071198209 A B
1523440457074343936 A B
1523440457077489664 A B
1523440457078538240 A B
1523440457079586816 A B
1523440457080635392 A B
1523440457116286976 A
1523440457116286977 A
1523440457117335552 A
1523440457138307072 A
1523440457149841408 A
1523440457170812928 A
1523440457172910080 A
1523440457173958656 A
1523440457173958657 A
1523440457175007232 A
1523440457175007233 A
1523440457180250112 A
1523440457181298688 A
1523440457181298689 A
1523440460638453760 B
1523440460641599488 B
1523440460641599489 B
1523440460653133824 B
1523440460708708352 B
1523440460881723392 B
1523440460915277824 B
1523440461056835584 B
1523440461057884160 B
1523440461145964544 B
1523440461206781952 B
1523440461227753472 B
1523440461237190656 B
1523440461259210752 B
1523440461272842240 B
1523440461370359808 B
1523440461379796992 B
1523440461486751744 B
1523440461550714880 B
1523440461615726592 B
1523440461659766784 B
1523440461713244160 B
1523440461754138624 B
1523440461787693056 B
1523440461817053184 B
1523440461862141952 B
1523440461881016320 B
1523440461917716480 B
1523440461939736576 B
1523440461953368064 B
1523440461987971072 B
1523440462001602560 B
1523440462224949248 B
1523440462292058112 B
1523440462313029632 B
1523440462325612544 B
1523440462379089920 B
1523440462421032960 B
1523440462461927424 B
1523440462486044672 B
1523440462501773312 B
1523440462545813504 B
1523440474431422464 B
{code}
Massive reorders, which PeerSync was not designed for.
Possible remedies:
- greatly lower the probability of these big reorders
- where there is overlap in versions, make PeerSync check that it is "dense"
(both shards have all docs in the overlap)
-- this seems extremely strict and could cause peersync to fail due to a
missing doc right at the end of an overlap... *which* end matters a lot.
- expand PeerSync to cover complete index
-- use hashes over *all* versions in the index
> HdfsChaosMonkeyNothingIsSafeTest failures
> -----------------------------------------
>
> Key: SOLR-8129
> URL: https://issues.apache.org/jira/browse/SOLR-8129
> Project: Solr
> Issue Type: Bug
> Reporter: Yonik Seeley
> Attachments: fail.151005_064958, fail.151005_080319
>
>
> New HDFS chaos test in SOLR-8123 hits a number of types of failures,
> including shard inconsistency.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]