Re: Cassandra Files Taking up Much More Space than CF

Nate Yoder Tue, 09 Dec 2014 10:40:55 -0800

Hi All,

Thanks for the help but after yet another day of investigation I think I
might be running into this
https://issues.apache.org/jira/browse/CASSANDRA-8061 issue where tmplink
files aren't removed until Cassandra is restarted.


Thanks again for all the suggestions!

Nate

--
*Nathanael Yoder*
Principal Engineer & Data Scientist, Whistle
415-944-7344 // n...@whistle.com

On Tue, Dec 9, 2014 at 10:18 AM, Nate Yoder <n...@whistle.com> wrote:

> Hi Reynald,
>
> Good idea but I have incremental backups turned off and other than *.db
> files nothing else appears to be in the data directory for that table.
>
> Is there any other output that would be helpful in helping you all help me?
>
> Thanks,
> Nate
>
> --
> *Nathanael Yoder*
> Principal Engineer & Data Scientist, Whistle
> 415-944-7344 // n...@whistle.com
>
> On Tue, Dec 9, 2014 at 9:27 AM, Reynald Bourtembourg <
> reynald.bourtembo...@esrf.fr> wrote:
>
>>  Hi Nate,
>>
>> Are you using incremental backups?
>>
>> Extract from the documentation (
>> http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_backup_incremental_t.html
>> ):
>>
>> *When incremental backups are enabled (disabled by default), Cassandra
>> hard-links each flushed SSTable to a backups directory under the keyspace
>> data directory. This allows storing backups offsite without transferring
>> entire snapshots. Also, incremental backups combine with snapshots to
>> provide a dependable, up-to-date backup mechanism.*
>>
>> *As with snapshots, Cassandra does not automatically clear incremental
>> backup files. DataStax recommends setting up a process to clear incremental
>> backup hard-links each time a new snapshot is created.*
>>  These backups are stored in directories named "backups" at the same
>> level as the "snapshots' directories.
>>
>> Reynald
>>
>>
>> On 09/12/2014 18:13, Nate Yoder wrote:
>>
>> Thanks for the advice.  Totally makes sense.  Once I figure out how to
>> make my data stop taking up more than 2x more space without being useful
>> I'll definitely make the change :)
>>
>>  Nate
>>
>>
>>
>>   --
>> *Nathanael Yoder*
>> Principal Engineer & Data Scientist, Whistle
>> 415-944-7344 // n...@whistle.com
>>
>> On Tue, Dec 9, 2014 at 9:02 AM, Jonathan Haddad <j...@jonhaddad.com>
>> wrote:
>>
>>> Well, I personally don't like RF=2.  It means if you're using CL=QUORUM
>>> and a node goes down, you're going to have a bad time. (downtime) If you're
>>> using CL=ONE then you'd be ok.  However, I am not wild about losing a node
>>> and having only 1 copy of my data available in prod.
>>>
>>>
>>> On Tue Dec 09 2014 at 8:40:37 AM Nate Yoder <n...@whistle.com> wrote:
>>>
>>>> Thanks Jonathan.  So there is nothing too idiotic about my current
>>>> set-up with 6 boxes each with 256 vnodes each and a RF of 2?
>>>>
>>>>  I appreciate the help,
>>>> Nate
>>>>
>>>>
>>>>
>>>>   --
>>>> *Nathanael Yoder*
>>>> Principal Engineer & Data Scientist, Whistle
>>>> 415-944-7344 // n...@whistle.com
>>>>
>>>>  On Tue, Dec 9, 2014 at 8:31 AM, Jonathan Haddad <j...@jonhaddad.com>
>>>> wrote:
>>>>
>>>>> You don't need a prime number of nodes in your ring, but it's not a
>>>>> bad idea to it be a multiple of your RF when your cluster is small.
>>>>>
>>>>>
>>>>> On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder <n...@whistle.com> wrote:
>>>>>
>>>>>> Hi Ian,
>>>>>>
>>>>>>  Thanks for the suggestion but I had actually already done that
>>>>>> prior to the scenario I described (to get myself some free space) and 
>>>>>> when
>>>>>> I ran nodetool cfstats it listed 0 snapshots as expected, so 
>>>>>> unfortunately
>>>>>> I don't think that is where my space went.
>>>>>>
>>>>>>  One additional piece of information I forgot to point out is that
>>>>>> when I ran nodetool status on the node it included all 6 nodes.
>>>>>>
>>>>>>  I have also heard it mentioned that I may want to have a prime
>>>>>> number of nodes which may help protect against split-brain.  Is this 
>>>>>> true?
>>>>>> If so does it still apply when I am using vnodes?
>>>>>>
>>>>>>  Thanks again,
>>>>>> Nate
>>>>>>
>>>>>>   --
>>>>>> *Nathanael Yoder*
>>>>>> Principal Engineer & Data Scientist, Whistle
>>>>>> 415-944-7344 // n...@whistle.com
>>>>>>
>>>>>> On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose <ianr...@fullstory.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Try `nodetool clearsnapshot` which will delete any snapshots you
>>>>>>> have.  I have never taken a snapshot with nodetool yet I found several
>>>>>>> snapshots on my disk recently (which can take a lot of space).  So 
>>>>>>> perhaps
>>>>>>> they are automatically generated by some operation?  No idea.  
>>>>>>> Regardless,
>>>>>>> nuking those freed up a ton of space for me.
>>>>>>>
>>>>>>>  - Ian
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder <n...@whistle.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>>  I am new to Cassandra so I apologise in advance if I have missed
>>>>>>>> anything obvious but this one currently has me stumped.
>>>>>>>>
>>>>>>>>  I am currently running a 6 node Cassandra 2.1.1 cluster on EC2
>>>>>>>> using C3.2XLarge nodes which overall is working very well for us.  
>>>>>>>> However,
>>>>>>>> after letting it run for a while I seem to get into a situation where 
>>>>>>>> the
>>>>>>>> amount of disk space used far exceeds the total amount of data on each 
>>>>>>>> node
>>>>>>>> and I haven't been able to get the size to go back down except by 
>>>>>>>> stopping
>>>>>>>> and restarting the node.
>>>>>>>>
>>>>>>>>  For example, in my data I have almost all of my data in one
>>>>>>>> table.  On one of my nodes right now the total space used (as reported 
>>>>>>>> by
>>>>>>>> nodetool cfstats) is 57.2 GB and there are no snapshots. However, when 
>>>>>>>> I
>>>>>>>> look at the size of the data files (using du) the data file for that 
>>>>>>>> table
>>>>>>>> is 107GB.  Because the C3.2XLarge only have 160 GB of SSD you can see 
>>>>>>>> why
>>>>>>>> this quickly becomes a problem.
>>>>>>>>
>>>>>>>>  Running nodetool compact didn't reduce the size and neither does
>>>>>>>> running nodetool repair -pr on the node.  I also tried nodetool flush 
>>>>>>>> and
>>>>>>>> nodetool cleanup (even though I have not added or removed any nodes
>>>>>>>> recently) but it didn't change anything either.  In order to keep my
>>>>>>>> cluster up I then stopped and started that node and the size of the 
>>>>>>>> data
>>>>>>>> file dropped to 54GB while the total column family size (as reported by
>>>>>>>> nodetool) stayed about the same.
>>>>>>>>
>>>>>>>>  Any suggestions as to what I could be doing wrong?
>>>>>>>>
>>>>>>>>  Thanks,
>>>>>>>> Nate
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
>>
>

Re: Cassandra Files Taking up Much More Space than CF

Reply via email to