Re: Increasing replication factor and repair doesn't seem to work

Bhuvan Rawal Tue, 24 May 2016 15:04:13 -0700

For the other DC, it can be acceptable because partition reside on one
node, so say  if you have a large partition, it may skew things a bit.
On May 25, 2016 2:41 AM, "Luke Jolly" <l...@getadmiral.com> wrote:


> So I guess the problem may have been with the initial addition of the
> 10.128.0.20 node because when I added it in it never synced data I
> guess?  It was at around 50 MB when it first came up and transitioned to
> "UN". After it was in I did the 1->2 replication change and tried repair
> but it didn't fix it.  From what I can tell all the data on it is stuff
> that has been written since it came up.  We never delete data ever so we
> should have zero tombstones.
>
> If I am not mistaken, only two of my nodes actually have all the data,
> 10.128.0.3 and 10.142.0.14 since they agree on the data amount. 10.142.0.13
> is almost a GB lower and then of course 10.128.0.20 which is missing over
> 5 GB of data.  I tried running nodetool -local on both DCs and it didn't
> fix either one.
>
> Am I running into a bug of some kind?
>
> On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>
>> Hi Luke,
>>
>> You mentioned that replication factor was increased from 1 to 2. In that
>> case was the node bearing ip 10.128.0.20 carried around 3GB data earlier?
>>
>> You can run nodetool repair with option -local to initiate repair local
>> datacenter for gce-us-central1.
>>
>> Also you may suspect that if a lot of data was deleted while the node was
>> down it may be having a lot of tombstones which is not needed to be
>> replicated to the other node. In order to verify the same, you can issue a
>> select count(*) query on column families (With the amount of data you have
>> it should not be an issue) with tracing on and with consistency local_all
>> by connecting to either 10.128.0.3  or 10.128.0.20 and store it in a
>> file. It will give you a fair amount of idea about how many deleted cells
>> the nodes have. I tried searching for reference if tombstones are moved
>> around during repair, but I didnt find evidence of it. However I see no
>> reason to because if the node didnt have data then streaming tombstones
>> does not make a lot of sense.
>>
>> Regards,
>> Bhuvan
>>
>> On Tue, May 24, 2016 at 11:06 PM, Luke Jolly <l...@getadmiral.com> wrote:
>>
>>> Here's my setup:
>>>
>>> Datacenter: gce-us-central1
>>> ===========================
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address      Load       Tokens       Owns (effective)  Host ID
>>>                         Rack
>>> UN  10.128.0.3   6.4 GB     256          100.0%
>>>  3317a3de-9113-48e2-9a85-bbf756d7a4a6  default
>>> UN  10.128.0.20  943.08 MB  256          100.0%
>>>  958348cb-8205-4630-8b96-0951bf33f3d3  default
>>> Datacenter: gce-us-east1
>>> ========================
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address      Load       Tokens       Owns (effective)  Host ID
>>>                         Rack
>>> UN  10.142.0.14  6.4 GB     256          100.0%
>>>  c3a5c39d-e1c9-4116-903d-b6d1b23fb652  default
>>> UN  10.142.0.13  5.55 GB    256          100.0%
>>>  d0d9c30e-1506-4b95-be64-3dd4d78f0583  default
>>>
>>> And my replication settings are:
>>>
>>> {'class': 'NetworkTopologyStrategy', 'aws-us-west': '2',
>>> 'gce-us-central1': '2', 'gce-us-east1': '2'}
>>>
>>> As you can see 10.128.0.20 in the gce-us-central1 DC only has a load of
>>> 943 MB even though it's supposed to own 100% and should have 6.4 GB.  Also 
>>> 10.142.0.13
>>> seems also not to have everything as it only has a load of 5.55 GB.
>>>
>>> On Mon, May 23, 2016 at 7:28 PM, kurt Greaves <k...@instaclustr.com>
>>> wrote:
>>>
>>>> Do you have 1 node in each DC or 2? If you're saying you have 1 node in
>>>> each DC then a RF of 2 doesn't make sense. Can you clarify on what your set
>>>> up is?
>>>>
>>>> On 23 May 2016 at 19:31, Luke Jolly <l...@getadmiral.com> wrote:
>>>>
>>>>> I am running 3.0.5 with 2 nodes in two DCs, gce-us-central1 and
>>>>> gce-us-east1.  I increased the replication factor of gce-us-central1 from 
>>>>> 1
>>>>> to 2.  Then I ran 'nodetool repair -dc gce-us-central1'.  The "Owns"
>>>>> for the node switched to 100% as it should but the Load showed that it
>>>>> didn't actually sync the data.  I then ran a full 'nodetool repair' and it
>>>>> didn't fix it still.  This scares me as I thought 'nodetool repair' was a
>>>>> way to assure consistency and that all the nodes were synced but it 
>>>>> doesn't
>>>>> seem to be.  Outside of that command, I have no idea how I would assure 
>>>>> all
>>>>> the data was synced or how to get the data correctly synced without
>>>>> decommissioning the node and re-adding it.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Kurt Greaves
>>>> k...@instaclustr.com
>>>> www.instaclustr.com
>>>>
>>>
>>>
>>

Re: Increasing replication factor and repair doesn't seem to work

Reply via email to