CPU hotspot at BloomFilterSerializer#deserialize

2013-01-30 Thread Takenori Sato
Hi all,

We have a situation that CPU loads on some of our nodes in a cluster has
spiked occasionally since the last November, which is triggered by requests
for rows that reside on two specific sstables.

We confirmed the followings(when spiked):

version: 1.0.7(current) <- 0.8.6 <- 0.8.5 <- 0.7.8
jdk: Oracle 1.6.0

1. a profiling showed that BloomFilterSerializer#deserialize was the
hotspot(70% of the total load by running threads)

* the stack trace looked like this(simplified)
90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb
90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow
...
90.4% - org.apache.cassandra.db.CollationController.collectTimeOrderedData
...
89.5% - org.apache.cassandra.db.columniterator.SSTableNamesIterator.read
...
79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter
68.9% - org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize
66.7% - java.io.DataInputStream.readLong

2. Usually, 1 should be so fast that a profiling by sampling can not detect

3. no pressure on Cassandra's VM heap nor on machine in overal

4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by "iostat 1
1000")

5. the problematic Data file contains only 5 to 10 keys data but large(2.4G)

6. the problematic Filter file size is only 256B(could be normal)


So now, I am trying to read the Filter file in the same way
BloomFilterSerializer#deserialize does as possible as I can, in order to
see if the file is something wrong.

Could you give me some advise on:

1. what is happening?
2. the best way to simulate the BloomFilterSerializer#deserialize
3. any more info required to proceed?

Thanks,
Takenori


Re: CPU hotspot at BloomFilterSerializer#deserialize

2013-02-03 Thread Takenori Sato
Hi Aaron,

Thanks for your answers. That helped me get a big picture.

Yes, it contains a big row that goes up to 2GB with more than a million of
columns.

Let me confirm if I correctly understand.

- The stack trace is from Slice By Names query. And the deserialization is
at the step 3, "Read the row level Bloom Filter", on your blog.

- BloomFilterSerializer#deserialize does readLong iteratively at each page
of size 4K for a given row, which means it could be 500,000 loops(calls
readLong) for a 2G row(from 1.0.7 source).

Correct?

That makes sense Slice By Names queries against such a wide row could be
CPU bottleneck. In fact, in our test environment, a
BloomFilterSerializer#deserialize of such a case takes more than 10ms, up
to 100ms.

> Get a single named column.
> Get the first 10 columns using the natural column order.
> Get the last 10 columns using the reversed order.

Interesting. A query pattern could make a difference?

We thought the only solutions is to change the data structure(don't use
such a wide row if it is retrieved by Slice By Names query).

Anyway, will give it a try!

Best,
Takenori

On Sat, Feb 2, 2013 at 2:55 AM, aaron morton wrote:

> 5. the problematic Data file contains only 5 to 10 keys data but
> large(2.4G)
>
> So very large rows ?
> What does nodetool cfstats or cfhistograms say about the row sizes ?
>
>
> 1. what is happening?
>
> I think this is partially large rows and partially the query pattern, this
> is only by roughly correct
> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ and my talk
> here http://www.datastax.com/events/cassandrasummit2012/presentations
>
> 3. any more info required to proceed?
>
> Do some tests with different query techniques…
>
> Get a single named column.
> Get the first 10 columns using the natural column order.
> Get the last 10 columns using the reversed order.
>
> Hope that helps.
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 31/01/2013, at 7:20 PM, Takenori Sato  wrote:
>
> Hi all,
>
> We have a situation that CPU loads on some of our nodes in a cluster has
> spiked occasionally since the last November, which is triggered by requests
> for rows that reside on two specific sstables.
>
> We confirmed the followings(when spiked):
>
> version: 1.0.7(current) <- 0.8.6 <- 0.8.5 <- 0.7.8
> jdk: Oracle 1.6.0
>
> 1. a profiling showed that BloomFilterSerializer#deserialize was the
> hotspot(70% of the total load by running threads)
>
> * the stack trace looked like this(simplified)
> 90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb
> 90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow
> ...
> 90.4% - org.apache.cassandra.db.CollationController.collectTimeOrderedData
> ...
> 89.5% - org.apache.cassandra.db.columniterator.SSTableNamesIterator.read
> ...
> 79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter
> 68.9% - org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize
> 66.7% - java.io.DataInputStream.readLong
>
> 2. Usually, 1 should be so fast that a profiling by sampling can not detect
>
> 3. no pressure on Cassandra's VM heap nor on machine in overal
>
> 4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by "iostat
> 1 1000")
>
> 5. the problematic Data file contains only 5 to 10 keys data but
> large(2.4G)
>
> 6. the problematic Filter file size is only 256B(could be normal)
>
>
> So now, I am trying to read the Filter file in the same way
> BloomFilterSerializer#deserialize does as possible as I can, in order to
> see if the file is something wrong.
>
> Could you give me some advise on:
>
> 1. what is happening?
> 2. the best way to simulate the BloomFilterSerializer#deserialize
> 3. any more info required to proceed?
>
> Thanks,
> Takenori
>
>
>


A fix for those who suffer from GC storm by tombstones

2014-10-07 Thread Takenori Sato
Hi,

I have filed a fix as CASSANDRA-8038, which would be a good news for those
who has suffered from overwhelming GC or OOM by tombstones.

Appreciate your feedbacks!

Thanks,
Takenori


Re: A fix for those who suffer from GC storm by tombstones

2014-10-07 Thread Takenori Sato
DuyHi and Rob, Thanks for your feedbacks.

Yeah, that's exactly the point I found. Some may want to run read repair even 
on tombstones as before, but others not like Rob and us.

Personally, I take read repaid as a nice to have feature, especially for 
tombstones, where a regular repair is anyway enforced.

So with this fix, I expect that a user can choose a better, manageable risk as 
needed. The good news is, the improvement for performance is significant!

- Takenori

iPhoneから送信

2014/10/08 3:18、Robert Coli  のメッセージ:

> 
>> On Tue, Oct 7, 2014 at 1:57 AM, DuyHai Doan  wrote:
>>  Read Repair belongs to the Anti-Entropy procedures to ensure that 
>> eventually, data from all replicas do converge. Tombstones are data 
>> (deletion marker) so they need to be exchanged between replicas. By skipping 
>> tombstone you prevent the data convergence with regard to deletion. 
> 
> Read repair is an optimization. I would probably just disable it in OP's case 
> and rely entirely on AES repair, because the 8303 approach makes read repair 
> not actually repair in some cases...
> 
> =Rob
>  


Re: Re[2]: how wide can wide rows get?

2014-11-13 Thread Takenori Sato
We have up to a few hundreds of millions of columns in a super wide row.

There are two major issues you should care about.

1. the wider the row is, the more memory pressure you get for every slice
query
2. repair is row based, which means a huge row could be transferred at
every repair

1 is not a big issue if you don't have many concurrent slice requests.
Having more cores is a good investment to reduce memory pressure.

2  could cause very high memory pressure as well as poorer disk utilization.


On Fri, Nov 14, 2014 at 3:21 PM, Plotnik, Alexey  wrote:

>  We have 380k of them in some of our rows and it's ok.
>
> -- Original Message --
> From: "Hannu Kröger" 
> To: "user@cassandra.apache.org" 
> Sent: 14.11.2014 16:13:49
> Subject: Re: how wide can wide rows get?
>
>
> The theoretical limit is maybe 2 billion but recommended max is around
> 10-20 thousand.
>
> Br,
> Hannu
>
> On 14.11.2014, at 8.10, Adaryl Bob Wakefield, MBA <
> adaryl.wakefi...@hotmail.com> wrote:
>
>   I’m struggling with this wide row business. Is there an upward limit on
> the number of columns you can have?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>


Re: Cass 1.1.11 out of memory during compaction ?

2013-11-03 Thread Takenori Sato
Try increasing column_index_size_in_kb.

A slice query to get some ranges(SliceFromReadCommand) requires to read all
the column indexes for the row, thus could hit OOM if you have a very wide
row.



On Sun, Nov 3, 2013 at 11:54 PM, Oleg Dulin  wrote:

> Cass 1.1.11 ran out of memory on me with this exception (see below).
>
> My parameters are 8gig heap, new gen is 1200M.
>
> ERROR [ReadStage:55887] 2013-11-02 23:35:18,419
> AbstractCassandraDaemon.java (line 132) Exception in thread
> Thread[ReadStage:55887,5,main]
> java.lang.OutOfMemoryError: Java heap space
>at 
> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:323)
>
>at org.apache.cassandra.utils.ByteBufferUtil.read(
> ByteBufferUtil.java:398)
>at 
> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:380)
>
>at 
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:88)
>
>at 
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:83)
>
>at 
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:73)
>
>at 
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37)
>
>at org.apache.cassandra.db.columniterator.IndexedSliceReader$
> IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:179)
>at org.apache.cassandra.db.columniterator.IndexedSliceReader.
> computeNext(IndexedSliceReader.java:121)
>at org.apache.cassandra.db.columniterator.IndexedSliceReader.
> computeNext(IndexedSliceReader.java:48)
>at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
>
>at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
>
>at org.apache.cassandra.db.columniterator.
> SSTableSliceIterator.hasNext(SSTableSliceIterator.java:116)
>at org.apache.cassandra.utils.MergeIterator$Candidate.
> advance(MergeIterator.java:147)
>at org.apache.cassandra.utils.MergeIterator$ManyToOne.
> advance(MergeIterator.java:126)
>at org.apache.cassandra.utils.MergeIterator$ManyToOne.
> computeNext(MergeIterator.java:100)
>at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
>
>at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
>
>at org.apache.cassandra.db.filter.SliceQueryFilter.
> collectReducedColumns(SliceQueryFilter.java:117)
>at org.apache.cassandra.db.filter.QueryFilter.
> collateColumns(QueryFilter.java:140)
>at 
> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:292)
>
>at 
> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64)
>
>at 
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1362)
>
>at 
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1224)
>
>at 
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1159)
>
>at org.apache.cassandra.db.Table.getRow(Table.java:378)
>at 
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69)
>
>at org.apache.cassandra.db.ReadVerbHandler.doVerb(
> ReadVerbHandler.java:51)
>at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
>
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>at java.lang.Thread.run(Thread.java:722)
>
>
> Any thoughts ?
>
> This is a dual data center set up, with 4 nodes in each DC and RF=2 in
> each.
>
>
> --
> Regards,
> Oleg Dulin
> http://www.olegdulin.com
>
>
>


Re: Cass 1.1.11 out of memory during compaction ?

2013-11-04 Thread Takenori Sato
I would go with cleanup.

Be careful for this bug.
https://issues.apache.org/jira/browse/CASSANDRA-5454


On Mon, Nov 4, 2013 at 9:05 PM, Oleg Dulin  wrote:

> If i do that, wouldn't I need to scrub my sstables ?
>
>
> Takenori Sato  wrote:
> > Try increasing column_index_size_in_kb.
> >
> > A slice query to get some ranges(SliceFromReadCommand) requires to read
> > all the column indexes for the row, thus could hit OOM if you have a
> very wide row.
> >
> > On Sun, Nov 3, 2013 at 11:54 PM, Oleg Dulin 
> wrote:
> >
> > Cass 1.1.11 ran out of memory on me with this exception (see below).
> >
> > My parameters are 8gig heap, new gen is 1200M.
> >
> > ERROR [ReadStage:55887] 2013-11-02 23:35:18,419
> > AbstractCassandraDaemon.java (line 132) Exception in thread
> > Thread[ReadStage:55887,5,main] java.lang.OutOfMemoryError: Java heap
> > space
> > at
> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:323)
> >
> >at org.apache.cassandra.utils.ByteBufferUtil.read(
> > ByteBufferUtil.java:398)at
> >
> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:380)
> >
> >at
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:88)
> >
> >at
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:83)
> >
> >at
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:73)
> >
> >at
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37)
> >
> >at org.apache.cassandra.db.columniterator.IndexedSliceReader$
> > IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:179)at
> > org.apache.cassandra.db.columniterator.IndexedSliceReader.
> > computeNext(IndexedSliceReader.java:121)at
> > org.apache.cassandra.db.columniterator.IndexedSliceReader.
> > computeNext(IndexedSliceReader.java:48)at
> >
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
> >
> >at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
> >
> >at org.apache.cassandra.db.columniterator.
> > SSTableSliceIterator.hasNext(SSTableSliceIterator.java:116)at
> > org.apache.cassandra.utils.MergeIterator$Candidate.
> > advance(MergeIterator.java:147)at
> > org.apache.cassandra.utils.MergeIterator$ManyToOne.
> > advance(MergeIterator.java:126)at
> > org.apache.cassandra.utils.MergeIterator$ManyToOne.
> > computeNext(MergeIterator.java:100)at
> >
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
> >
> >at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
> >
> >at org.apache.cassandra.db.filter.SliceQueryFilter.
> > collectReducedColumns(SliceQueryFilter.java:117)at
> > org.apache.cassandra.db.filter.QueryFilter.
> > collateColumns(QueryFilter.java:140)
> > at
> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:292)
> >
> >at
> >
> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64)
> >
> >at
> >
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1362)
> >
> >at
> >
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1224)
> >
> >at
> >
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1159)
> >
> >at org.apache.cassandra.db.Table.getRow(Table.java:378)at
> >
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69)
> >
> >at org.apache.cassandra.db.ReadVerbHandler.doVerb(
> > ReadVerbHandler.java:51)at
> >
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
> >
> >at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >
> >at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >
> >at java.lang.Thread.run(Thread.java:722)
> >
> > Any thoughts ?
> >
> > This is a dual data center set up, with 4 nodes in each DC and RF=2 in
> each.
> >
> > --
> > Regards,
> > Oleg Dulin http://www.olegdulin.com";>http://www.olegdulin.com
> 
>
>


Re: Tracing Queries at Cassandra Server

2013-11-10 Thread Takenori Sato
In addition to CassandraServer, add StorageProxy for details as follows.

log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG
log4j.logger.org.apache.cassandra.thrift.CassandraServer=DEBUG

Hope it would help.


On Mon, Nov 11, 2013 at 11:25 AM, Srinath Perera  wrote:

> I am talking to Cassandra using Hector. Is there a way that I can trace
> the executed queries at the server?
>
> I have tired adding Enable DEBUG logging for
> org.apache.cassandra.thrift.CassandraServer as mentioned in Cassandra vs
> logging 
> activity.
> But that does not provide much info (e.g. says slice query executed, but
> does not give more info).
>
> What I look for is something like SQL tracing in MySQL, so all queries
> executed are logged.
>
> --Srinath
>
>
>


Re: Recommended amount of free disk space for compaction

2013-11-29 Thread Takenori Sato
Hi,

> If Cassandra only compacts one table at a time, then I should be safe if
I keep as much free space as there is data in the largest table. If
Cassandra can compact multiple tables simultaneously, then it seems that I
need as much free space as all the tables put together, which means no more
than 50% utilization.

Based on your configuration. 1 per CPU core by default. See
concurrent_compactors for details.

> Also, what happens if a node gets low on disk space and there isn’t
enough available for compaction?

A compaction checks if there's enough disk space based on its estimate.
Otherwise, it won't get executed.

> Is there a way to salvage a node that gets into a state where it cannot
compact its tables?

If you carefully run some cleanups, then you'll get some room based on its
new range.


On Fri, Nov 29, 2013 at 12:21 PM, Robert Wille  wrote:

> I’m trying to estimate our disk space requirements and I’m wondering about
> disk space required for compaction.
>
> My application mostly inserts new data and performs updates to existing
> data very infrequently, so there will be very few bytes removed by
> compaction. It seems that if a major compaction occurs, that performing the
> compaction will require as much disk space as is currently consumed by the
> table.
>
> So here’s my question. If Cassandra only compacts one table at a time,
> then I should be safe if I keep as much free space as there is data in the
> largest table. If Cassandra can compact multiple tables simultaneously,
> then it seems that I need as much free space as all the tables put
> together, which means no more than 50% utilization. So, how much free space
> do I need? Any rules of thumb anyone can offer?
>
> Also, what happens if a node gets low on disk space and there isn’t enough
> available for compaction? If I add new nodes to reduce the amount of data
> on each node, I assume the space won’t be reclaimed until a compaction
> event occurs. Is there a way to salvage a node that gets into a state where
> it cannot compact its tables?
>
> Thanks
>
> Robert
>
>


Re: need help with Cassandra 1.2 Full GCing -- output of jmap histogram

2014-03-09 Thread Takenori Sato
You have millions of org.apache.cassandra.db.DeletedColumn instances on the
snapshot.

This means you have lots of column tombstones, and I guess, which are read
into memory by slice query.


On Sun, Mar 9, 2014 at 10:55 PM, Oleg Dulin  wrote:

> I am trying to understand why one of my nodes keeps full GC.
>
> I have Xmx set to 8gigs, memtable total size is 2 gigs.
>
> Consider the top entries from jmap -histo:live @
> http://pastebin.com/UaatHfpJ
>
> --
> Regards,
> Oleg Dulin
> http://www.olegdulin.com
>
>
>


Re: need help with Cassandra 1.2 Full GCing -- output of jmap histogram

2014-03-11 Thread Takenori Sato
In addition to the suggestions by Jonathan, you can run a user defined
compaction against a particular set of SSTable files, where you want to
remove tombstones.

But to do that, you need to find such an optimal set. Here you can find a
couple of helpful tools.

https://github.com/cloudian/support-tools


On Mon, Mar 10, 2014 at 7:41 PM, Oleg Dulin  wrote:

> I get that :)
>
> What I'd like to know is how to fix that :)
>
>
> On 2014-03-09 20:24:54 +, Takenori Sato said:
>
>  You have millions of org.apache.cassandra.db.DeletedColumn instances on
>> the snapshot.
>>
>> This means you have lots of column tombstones, and I guess, which are
>> read into memory by slice query.
>>
>>
>> On Sun, Mar 9, 2014 at 10:55 PM, Oleg Dulin  wrote:
>> I am trying to understand why one of my nodes keeps full GC.
>>
>> I have Xmx set to 8gigs, memtable total size is 2 gigs.
>>
>> Consider the top entries from jmap -histo:live @
>> http://pastebin.com/UaatHfpJ
>>
>> --
>> Regards,
>> Oleg Dulin
>> http://www.olegdulin.com
>>
>
>
> --
> Regards,
> Oleg Dulin
> http://www.olegdulin.com
>
>
>


Re: Cleanup understastanding

2013-05-29 Thread Takenori Sato
> But, that is still awkward. Does cleanup take so much disk space to
complete the compaction operation? In other words, twice the size?

Not really, but logically yes.

According to 1.0.7 source, cleanup checks if there's enough space that is
larger than the worst scenario as below. If not, the exception you got is
thrown.

/*
 * Add up all the files sizes this is the worst case file
 * size for compaction of all the list of files given.
 */
public long getExpectedCompactedFileSize(Iterable
sstables)
{
long expectedFileSize = 0;
for (SSTableReader sstable : sstables)
{
long size = sstable.onDiskLength();
expectedFileSize = expectedFileSize + size;
}
return expectedFileSize;
}


On Wed, May 29, 2013 at 10:43 PM, Víctor Hugo Oliveira Molinar <
vhmoli...@gmail.com> wrote:

> Thanks for the answers.
>
> I got it. I was using cleanup, because I thought it would delete the
> tombstones.
> But, that is still awkward. Does cleanup take so much disk space to
> complete the compaction operation? In other words, twice the size?
>
>
> *Atenciosamente,*
> *Víctor Hugo Molinar - *@vhmolinar <http://twitter.com/#!/vhmolinar>
>
>
> On Tue, May 28, 2013 at 9:55 PM, Takenori Sato(Cloudian) <
> ts...@cloudian.com> wrote:
>
>>  Hi Victor,
>>
>> As Andrey said, running cleanup doesn't work as you expect.
>>
>>
>> > The reason I need to clean things is that I wont need most of my
>> inserted data on the next day.
>>
>> Deleted objects(columns/records) become deletable from sstable file when
>> they get expired(after gc_grace_seconds).
>>
>> Such deletable objects are actually gotten rid of by compaction.
>>
>> The tricky part is that a deletable object remains unless all of its old
>> objects(the same row key) are contained in the set of sstable files
>> involved in the compaction.
>>
>> - Takenori
>>
>>
>> (2013/05/29 3:01), Andrey Ilinykh wrote:
>>
>> cleanup removes data which doesn't belong to the current node. You have
>> to run it only if you move (or add new) nodes. In your case there is no any
>> reason to do it.
>>
>>
>> On Tue, May 28, 2013 at 7:39 AM, Víctor Hugo Oliveira Molinar <
>> vhmoli...@gmail.com> wrote:
>>
>>> Hello everyone.
>>> I have a daily maintenance task at c* which does:
>>>
>>> -truncate cfs
>>> -clearsnapshots
>>> -repair
>>> -cleanup
>>>
>>> The reason I need to clean things is that I wont need most of my
>>> inserted data on the next day. It's kind a business requirement.
>>>
>>> Well,  the problem I'm running to, is the misunderstanding about cleanup
>>> operation.
>>> I have 2 nodes with lower than half usage of disk, which is moreless
>>> 13GB;
>>>
>>> But, the last few days, arbitrarily each node have reported me a cleanup
>>> error indicating that the disk was full. Which is not true.
>>>
>>> *Error occured during cleanup*
>>> *java.util.concurrent.ExecutionException: java.io.IOException: disk full
>>> *
>>>
>>>
>>>  So I'd like to know more about what does happens in a cleanup
>>> operation.
>>> Appreciate any help.
>>>
>>
>>
>>
>


Re: Reduce Cassandra GC

2013-06-15 Thread Takenori Sato
> INFO [ScheduledTasks:1] 2013-04-15 14:00:02,749 GCInspector.java (line
122) GC for ParNew: 338798 ms for 1 collections, 592212416 used; max is
1046937600

This says GC for New Generation took so long. And this is usually unlikely.

The only situation I am aware of is when a fairly large object is created,
and which can not be promoted to Old Generation because it requires such a
large *contiguous* memory space that is unavailable at the point in time.
This is called promotion failure. So it has to wait until concurrent
collector collects a large enough space. Thus you experience stop the
world. But I think it is not stop the world, but only stop the new world.

For example in case of Cassandra, a large number of
in_memory_compaction_limit_in_mb can cause this. This is a limit when a
compaction compacts(merges) rows of a key into the latest in memory. So
this creates a large byte array up to the number.

You can confirm this by enabling promotion failure GC logging in the
future, and by checking compactions executed at that point in time.



On Sat, Jun 15, 2013 at 10:01 AM, Robert Coli  wrote:

> On Fri, Jun 7, 2013 at 12:42 PM, Igor  wrote:
> > If you are talking about 1.2.x then I also have memory problems on the
> idle
> > cluster: java memory constantly slow grows up to limit, then spend long
> time
> > for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle
> > cluster java memory stay on the same value.
>
> If you are not aware of a pre-existing JIRA, I strongly encourage you to :
>
> 1) Document your experience of this.
> 2) Search issues.apache.org for anything that sounds similar.
> 3) If you are unable to find a JIRA, file one.
>
> Thanks!
>
> =Rob
>


Re: Reduce Cassandra GC

2013-06-15 Thread Takenori Sato
Uncomment the followings in "cassandra-env.sh".

JVM_OPTS="$JVM_OPTS -XX:+PrintGCDateStamps"

JVM_OPTS="$JVM_OPTS -XX:+PrintPromotionFailure"
JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc-`date +%s`.log"

*> *Also can you take a heap dump at 2 diff points so that we can compare it?

No, I'm afraid. I ordinary use profiling tools, but am not aware of
anything that could respond during this event.



On Sun, Jun 16, 2013 at 4:44 AM, Mohit Anchlia wrote:

> Can you paste you gc config? Also can you take a heap dump at 2 diff
> points so that we can compare it?
>
> Quick thing to do would be to do a histo live at 2 points and compare
>
> Sent from my iPhone
>
> On Jun 15, 2013, at 6:57 AM, Takenori Sato  wrote:
>
> > INFO [ScheduledTasks:1] 2013-04-15 14:00:02,749 GCInspector.java (line
> 122) GC for ParNew: 338798 ms for 1 collections, 592212416 used; max is
> 1046937600
>
> This says GC for New Generation took so long. And this is usually
> unlikely.
>
> The only situation I am aware of is when a fairly large object is created,
> and which can not be promoted to Old Generation because it requires such a
> large *contiguous* memory space that is unavailable at the point in time.
> This is called promotion failure. So it has to wait until concurrent
> collector collects a large enough space. Thus you experience stop the
> world. But I think it is not stop the world, but only stop the new world.
>
> For example in case of Cassandra, a large number of
> in_memory_compaction_limit_in_mb can cause this. This is a limit when a
> compaction compacts(merges) rows of a key into the latest in memory. So
> this creates a large byte array up to the number.
>
> You can confirm this by enabling promotion failure GC logging in the
> future, and by checking compactions executed at that point in time.
>
>
>
> On Sat, Jun 15, 2013 at 10:01 AM, Robert Coli wrote:
>
>> On Fri, Jun 7, 2013 at 12:42 PM, Igor  wrote:
>> > If you are talking about 1.2.x then I also have memory problems on the
>> idle
>> > cluster: java memory constantly slow grows up to limit, then spend long
>> time
>> > for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle
>> > cluster java memory stay on the same value.
>>
>> If you are not aware of a pre-existing JIRA, I strongly encourage you to :
>>
>> 1) Document your experience of this.
>> 2) Search issues.apache.org for anything that sounds similar.
>> 3) If you are unable to find a JIRA, file one.
>>
>> Thanks!
>>
>> =Rob
>>
>
>


Re: Reduce Cassandra GC

2013-06-15 Thread Takenori Sato
> Also can you take a heap dump at 2 diff points so that we can compare it?

Also note that a promotion failure won't happen by a particular object, but
by a fragmentation in Old Generation space. So I am not sure if you can't
tell by a heap dump comparison.


On Sun, Jun 16, 2013 at 4:44 AM, Mohit Anchlia wrote:

> Can you paste you gc config? Also can you take a heap dump at 2 diff
> points so that we can compare it?
>
> Quick thing to do would be to do a histo live at 2 points and compare
>
> Sent from my iPhone
>
> On Jun 15, 2013, at 6:57 AM, Takenori Sato  wrote:
>
> > INFO [ScheduledTasks:1] 2013-04-15 14:00:02,749 GCInspector.java (line
> 122) GC for ParNew: 338798 ms for 1 collections, 592212416 used; max is
> 1046937600
>
> This says GC for New Generation took so long. And this is usually
> unlikely.
>
> The only situation I am aware of is when a fairly large object is created,
> and which can not be promoted to Old Generation because it requires such a
> large *contiguous* memory space that is unavailable at the point in time.
> This is called promotion failure. So it has to wait until concurrent
> collector collects a large enough space. Thus you experience stop the
> world. But I think it is not stop the world, but only stop the new world.
>
> For example in case of Cassandra, a large number of
> in_memory_compaction_limit_in_mb can cause this. This is a limit when a
> compaction compacts(merges) rows of a key into the latest in memory. So
> this creates a large byte array up to the number.
>
> You can confirm this by enabling promotion failure GC logging in the
> future, and by checking compactions executed at that point in time.
>
>
>
> On Sat, Jun 15, 2013 at 10:01 AM, Robert Coli wrote:
>
>> On Fri, Jun 7, 2013 at 12:42 PM, Igor  wrote:
>> > If you are talking about 1.2.x then I also have memory problems on the
>> idle
>> > cluster: java memory constantly slow grows up to limit, then spend long
>> time
>> > for GC. I never seen such behaviour for 1.0.x and 1.1.x, where on idle
>> > cluster java memory stay on the same value.
>>
>> If you are not aware of a pre-existing JIRA, I strongly encourage you to :
>>
>> 1) Document your experience of this.
>> 2) Search issues.apache.org for anything that sounds similar.
>> 3) If you are unable to find a JIRA, file one.
>>
>> Thanks!
>>
>> =Rob
>>
>
>


Re: Reduce Cassandra GC

2013-06-17 Thread Takenori Sato
Find "promotion failure". Bingo if it happened at the time.

Otherwise, post the relevant portion of the log here. Someone may find a
hint.


On Mon, Jun 17, 2013 at 5:51 PM, Joel Samuelsson
wrote:

> Just got a very long GC again. What am I to look for in the logging I just
> enabled?
>
>
> 2013/6/17 Joel Samuelsson 
>
>> > If you are talking about 1.2.x then I also have memory problems on the
>> idle cluster: java memory constantly slow grows up to limit, then spend
>> long time for GC. I never seen such behaviour for  1.0.x and 1.1.x, where
>> on idle cluster java memory stay on the same value.
>>
>> No I am running Cassandra 1.1.8.
>>
>> > Can you paste you gc config?
>>
>> I believe the relevant configs are these:
>> # GC tuning options
>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>
>> I haven't changed anything in the environment config up until now.
>>
>> > Also can you take a heap dump at 2 diff points so that we can compare
>> it?
>>
>> I can't access the machine at all during the stop-the-world freezes. Was
>> that what you wanted me to try?
>>
>> > Uncomment the followings in "cassandra-env.sh".
>> Done. Will post results as soon as I get a new stop-the-world gc.
>>
>> > If you are unable to find a JIRA, file one
>>
>> Unless this turns out to be a problem on my end, I will.
>>
>
>


Re: Reduce Cassandra GC

2013-06-18 Thread Takenori Sato
  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,515 StatusLogger.java (line
> 116) testing_Keyspace.cf14 0,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,515 StatusLogger.java (line
> 116) testing_Keyspace.cf15 0,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,515 StatusLogger.java (line
> 116) testing_Keyspace.cf16 0,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,515 StatusLogger.java (line
> 116) testing_Keyspace.cf17 0,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,516 StatusLogger.java (line
> 116) testing_Keyspace.cf18 0,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,516 StatusLogger.java (line
> 116) testing_Keyspace.cf19 0,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,516 StatusLogger.java (line
> 116) testing_Keyspace.cf20 0,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,516 StatusLogger.java (line
> 116) testing_Keyspace.cf21 0,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line
> 116) testing_Keyspace.cf22 0,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line
> 116) OpsCenter.rollups7200 0,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line
> 116) OpsCenter.rollups864000,0
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line
> 116) OpsCenter.rollups60 13745,3109686
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,517 StatusLogger.java (line
> 116) OpsCenter.events   18,826
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,518 StatusLogger.java (line
> 116) OpsCenter.rollups300      2516,570931
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,519 StatusLogger.java (line
> 116) OpsCenter.pdps9072,160850
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,519 StatusLogger.java (line
> 116) OpsCenter.events_timeline3,86
>  INFO [ScheduledTasks:1] 2013-06-17 08:13:47,520 StatusLogger.java (line
> 116) OpsCenter.settings0,0
>
> And from gc-1371454124.log I get:
> 2013-06-17T08:11:22.300+: 2551.288: [GC 870971K->216494K(4018176K),
> 145.1887460 secs]
>
>
> 2013/6/18 Takenori Sato 
>
>> Find "promotion failure". Bingo if it happened at the time.
>>
>> Otherwise, post the relevant portion of the log here. Someone may find a
>> hint.
>>
>>
>> On Mon, Jun 17, 2013 at 5:51 PM, Joel Samuelsson <
>> samuelsson.j...@gmail.com> wrote:
>>
>>> Just got a very long GC again. What am I to look for in the logging I
>>> just enabled?
>>>
>>>
>>> 2013/6/17 Joel Samuelsson 
>>>
>>>> > If you are talking about 1.2.x then I also have memory problems on
>>>> the idle cluster: java memory constantly slow grows up to limit, then spend
>>>> long time for GC. I never seen such behaviour for  1.0.x and 1.1.x, where
>>>> on idle cluster java memory stay on the same value.
>>>>
>>>> No I am running Cassandra 1.1.8.
>>>>
>>>> > Can you paste you gc config?
>>>>
>>>> I believe the relevant configs are these:
>>>> # GC tuning options
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
>>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
>>>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>>>>
>>>> I haven't changed anything in the environment config up until now.
>>>>
>>>> > Also can you take a heap dump at 2 diff points so that we can
>>>> compare it?
>>>>
>>>> I can't access the machine at all during the stop-the-world freezes.
>>>> Was that what you wanted me to try?
>>>>
>>>> > Uncomment the followings in "cassandra-env.sh".
>>>> Done. Will post results as soon as I get a new stop-the-world gc.
>>>>
>>>> > If you are unable to find a JIRA, file one
>>>>
>>>> Unless this turns out to be a problem on my end, I will.
>>>>
>>>
>>>
>>
>


Re: Reduce Cassandra GC

2013-06-19 Thread Takenori Sato
GC options are not set. You should see the followings.

 -XX:+PrintGCDateStamps -XX:+PrintPromotionFailure
-Xloggc:/var/log/cassandra/gc-1371603607.log

> Is it normal to have two processes like this?

No. You are running two processes.


On Wed, Jun 19, 2013 at 4:16 PM, Joel Samuelsson
wrote:

> My Cassandra ps info:
>
> root 26791 1  0 07:14 ?00:00:00 /usr/bin/jsvc -user
> cassandra -home /opt/java/64/jre1.6.0_32/bin/../ -pidfile
> /var/run/cassandra.pid -errfile &1 -outfile /var/log/cassandra/output.log
> -cp
> /usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang-2.6.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/guava-13.0.1.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/jna.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.7.0.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.1.0.jar:/usr/share/cassandra/lib/metrics-core-2.0.3.jar:/usr/share/cassandra/lib/netty-3.5.9.Final.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar:/usr/share/cassandra/lib/snappy-java-1.0.4.1.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/apache-cassandra-1.2.5.jar:/usr/share/cassandra/apache-cassandra-thrift-1.2.5.jar:/usr/share/cassandra/apache-cassandra.jar:/usr/share/cassandra/stress.jar:/usr/share/java/jna.jar:/etc/cassandra:/usr/share/java/commons-daemon.jar
> -Dlog4j.configuration=log4j-server.properties
> -Dlog4j.defaultInitOverride=true
> -XX:HeapDumpPath=/var/lib/cassandra/java_1371626058.hprof
> -XX:ErrorFile=/var/lib/cassandra/hs_err_1371626058.log -ea
> -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities
> -XX:ThreadPriorityPolicy=42 -Xms4004M -Xmx4004M -Xmn800M
> -XX:+HeapDumpOnOutOfMemoryError -Xss180k -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
> -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB
> -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> org.apache.cassandra.service.CassandraDaemon
> 103  26792 26791 99 07:14 ?854015-22:02:22 /usr/bin/jsvc -user
> cassandra -home /opt/java/64/jre1.6.0_32/bin/../ -pidfile
> /var/run/cassandra.pid -errfile &1 -outfile /var/log/cassandra/output.log
> -cp
> /usr/share/cassandra/lib/antlr-3.2.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-lang-2.6.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/cassandra/lib/guava-13.0.1.jar:/usr/share/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/cassandra/lib/jamm-0.2.5.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jline-1.0.jar:/usr/share/cassandra/lib/jna.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/libthrift-0.7.0.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/lz4-1.1.0.jar:/usr/share/cassandra/lib/metrics-core-2.0.3.jar:/usr/share/cassandra/lib/netty-3.5.9.Final.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.7.2.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar:/usr/share/cassandra/lib/snappy-java-1.0.4.1.jar:/usr/share/cassandra/lib/snaptree-0.1.jar:/usr/share/cassandra/apache-cassandra-1.2.5.jar:/usr/share/cassandra/apache-cassandra-thrift-1.2.5.jar:/usr/share/cassandra/apache-cassandra.jar:/usr/share/cassandra/stress.jar:/usr/share/java/jna.jar:/etc/cassandra:/usr/share/java/commons-daemon.jar
> -Dlog4j.configuration=log4j-server.properties
> -Dlog4j.defaultInitOverride=true
> -XX:HeapDumpPath=/var/lib/cassandra/java_1371626058.hprof
> -XX:ErrorFile=/var/lib/cassandra/hs_err_1371626058.log -ea
> -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities
> -XX:ThreadPriorityPolicy=42 -Xms4004M -Xmx4004M -Xmn800M
> -XX:+HeapDumpO

Re: Alternate "major compaction"

2013-07-11 Thread Takenori Sato
Hi,

I think it is a common headache for users running a large Cassandra cluster
in production.


Running a major compaction is not the only cause, but more. For example, I
see two typical scenario.

1. backup use case
2. active wide row

In the case of 1, say, one data is removed a year later. This means,
tombstone on the row is 1 year away from the original row. To remove an
expired row entirely, a compaction set has to include all the rows. So,
when do the original, 1 year old row, and the tombstoned row are included
in a compaction set? It is likely to take one year.

In the case of 2, such an active wide row exists in most of sstable files.
And it typically contains many expired columns. But none of them wouldn't
be removed entirely because a compaction set practically do not include all
the row fragments.


Btw, there is a very convenient MBean API is available. It is
CompactionManager's forceUserDefinedCompaction. You can invoke a minor
compaction on a file set you define. So the question is how to find an
optimal set of sstable files.

Then, I wrote a tool to check garbage, and print outs some useful
information to find such an optimal set.

Here's a simple log output.

# /opt/cassandra/bin/checksstablegarbage -e
/cassandra_data/UserData/Test5_BLOB-hc-4-Data.db
[Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData,
Test5_BLOB, 300(1373504071)]
===
ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED,
REMAINNING_SSTABLE_FILES
===
hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db
---
TOTAL, 40, 40
===

REMAINNING_SSTABLE_FILES means any other sstable files that contain the
respective row. So, the following is an optimal set.

# /opt/cassandra/bin/checksstablegarbage -e
/cassandra_data/UserData/Test5_BLOB-hc-4-Data.db
/cassandra_data/UserData/Test5_BLOB-hc-3-Data.db
[Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData,
Test5_BLOB, 300(1373504131)]
===
ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED,
REMAINNING_SSTABLE_FILES
===
hello5/100.txt.1373502926003, 223, 0, YES, YES
---
TOTAL, 223, 0
===

This tool relies on SSTableReader and an aggregation iterator as Cassandra
does in compaction. I was considering to share this with the community. So
let me know if anyone is interested.

Ah, note that it is based on 1.0.7. So I will need to check and update for
newer versions.

Thanks,
Takenori


On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez wrote:

> Hi
>
> About a year ago, we did a major compaction in our cassandra cluster (a
> n00b mistake, I know), and since then we've had huge sstables that never
> get compacted, and we were condemned to repeat the major compaction process
> every once in a while (we are using SizeTieredCompaction strategy, and
> we've not avaluated yet LeveledCompaction, because it has its downsides,
> and we've had no time to test all of them in our environment).
>
> I was trying to find a way to solve this situation (that is, do something
> like a major compaction that writes small sstables, not huge as major
> compaction does), and I couldn't find it in the documentation. I tried
> cleanup and scrub/upgradesstables, but they don't do that (as documentation
> states). Then I tried deleting all data in a node and then bootstrapping it
> (or "nodetool rebuild"-ing it), hoping that this way the sstables would get
> cleaned from deleted records and updates. But the deleted node just copied
> the sstables from another node as they were, cleaning nothing.
>
> So I tried a new approach: I switched the sstable compaction strategy
> (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch,
> and then switching it back (Leveled to SizeTiered). It took a while (but so
> do the major compaction process) and it worked, I have smaller sstables,
> and I've regained a lot of disk space.
>
> I'm happy with the results, but it doesn't seem a orthodox way of
> "cleaning" the sstables. What do you think, is it something wrong or crazy?
> Is there a different way to achieve the same thing?
>
> Let's put an example:
> Suppose you have a write-only columnfamily (no updates and no deletes, so
> no need for LeveledCompaction, because SizeTiered works perfectly and
> requires less I/O) and you mistakenly run a major compaction on it. After a
> few months you need more space and you delete half the data, and you

Re: Alternate "major compaction"

2013-07-11 Thread Takenori Sato
Hi,

I made the repository public. Now you can checkout from here.

https://github.com/cloudian/support-tools

checksstablegarbage is the tool.

Enjoy, and any feedback is welcome.

Thanks,
- Takenori


On Thu, Jul 11, 2013 at 10:12 PM, srmore  wrote:

> Thanks Takenori,
> Looks like the tool provides some good info that people can use. It would
> be great if you can share it with the community.
>
>
>
>
> On Thu, Jul 11, 2013 at 6:51 AM, Takenori Sato  wrote:
>
>> Hi,
>>
>> I think it is a common headache for users running a large Cassandra
>> cluster in production.
>>
>>
>> Running a major compaction is not the only cause, but more. For example,
>> I see two typical scenario.
>>
>> 1. backup use case
>> 2. active wide row
>>
>> In the case of 1, say, one data is removed a year later. This means,
>> tombstone on the row is 1 year away from the original row. To remove an
>> expired row entirely, a compaction set has to include all the rows. So,
>> when do the original, 1 year old row, and the tombstoned row are included
>> in a compaction set? It is likely to take one year.
>>
>> In the case of 2, such an active wide row exists in most of sstable
>> files. And it typically contains many expired columns. But none of them
>> wouldn't be removed entirely because a compaction set practically do not
>> include all the row fragments.
>>
>>
>> Btw, there is a very convenient MBean API is available. It is
>> CompactionManager's forceUserDefinedCompaction. You can invoke a minor
>> compaction on a file set you define. So the question is how to find an
>> optimal set of sstable files.
>>
>> Then, I wrote a tool to check garbage, and print outs some useful
>> information to find such an optimal set.
>>
>> Here's a simple log output.
>>
>> # /opt/cassandra/bin/checksstablegarbage -e 
>> /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db
>> [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 
>> 300(1373504071)]
>> ===
>> ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, 
>> REMAINNING_SSTABLE_FILES
>> ===
>> hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db
>> ---
>> TOTAL, 40, 40
>> ===
>>
>> REMAINNING_SSTABLE_FILES means any other sstable files that contain the
>> respective row. So, the following is an optimal set.
>>
>> # /opt/cassandra/bin/checksstablegarbage -e 
>> /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db 
>> /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db
>> [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 
>> 300(1373504131)]
>> ===
>> ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, 
>> REMAINNING_SSTABLE_FILES
>> ===
>> hello5/100.txt.1373502926003, 223, 0, YES, YES
>> ---
>> TOTAL, 223, 0
>> ===
>>
>> This tool relies on SSTableReader and an aggregation iterator as
>> Cassandra does in compaction. I was considering to share this with the
>> community. So let me know if anyone is interested.
>>
>> Ah, note that it is based on 1.0.7. So I will need to check and update
>> for newer versions.
>>
>> Thanks,
>> Takenori
>>
>>
>> On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez 
>> wrote:
>>
>>> Hi
>>>
>>> About a year ago, we did a major compaction in our cassandra cluster (a
>>> n00b mistake, I know), and since then we've had huge sstables that never
>>> get compacted, and we were condemned to repeat the major compaction process
>>> every once in a while (we are using SizeTieredCompaction strategy, and
>>> we've not avaluated yet LeveledCompaction, because it has its downsides,
>>> and we've had no time to test all of them in our environment).
>>>
>>> I was trying to find a way to solve this situation (that is, do
>>> something like a major compaction that writes small ss

Re: Alternate "major compaction"

2013-07-12 Thread Takenori Sato
It's light. Without -v option, you can even run it against just a SSTable
file without needing the whole Cassandra installation.

- Takenori


On Sat, Jul 13, 2013 at 6:18 AM, Robert Coli  wrote:

> On Thu, Jul 11, 2013 at 9:43 PM, Takenori Sato  wrote:
>
>> I made the repository public. Now you can checkout from here.
>>
>> https://github.com/cloudian/support-tools
>>
>> checksstablegarbage is the tool.
>>
>> Enjoy, and any feedback is welcome.
>>
>
> Thanks very much, useful tool!
>
> Out of curiousity, what does "writesstablekeys" do that the upstream tool
> "sstablekeys" does not?
>
> =Rob
>


Fp chance for column level bloom filter

2013-07-17 Thread Takenori Sato
Hi,

I thought memory consumption of column level bloom filter will become a big
concern when a row becomes very wide like more than tens of millions of
columns.

But I read from source(1.0.7) that fp chance for column level bloom filter
is hard-coded as 0.160, which is very high. So seems not.

Is this correct?

Thanks,
Takenori


Random Distribution, yet Order Preserving Partitioner

2013-08-22 Thread Takenori Sato
Hi,

I am trying to implement a custom partitioner that evenly distributes, yet
preserves order.

The partitioner returns a token by BigInteger as RandomPartitioner does,
while does a decorated key by string as OrderPreservingPartitioner does.
* for now, since IPartitioner does not support different types for token
and key, BigInteger is simply converted to string

Then, I played around with cassandra-cli. As expected, in my 3 nodes test
cluster, get/set worked, but list(get_range_slices) didn't.

This came from a challenge to overcome a wide row scalability. So, I want
to make it work!

I am aware that some efforts are required to make get_range_slices work.
But are there any other critical problems? For example, it seems there is
an assumption that token and key are the same. If this is throughout the
whole C* code, this partitioner is not practical.

Or have your tried something similar?

I would appreciate your feedback!

Thanks,
Takenori


Re: Random Distribution, yet Order Preserving Partitioner

2013-08-22 Thread Takenori Sato
Hi Nick,

> token and key are not same. it was like this long time ago (single MD5
assumed single key)

True. That reminds me of making a test with the latest 1.2 instead of our
current 1.0!

> if you want ordered, you probably can arrange your data in a way so you
can get it in ordered fashion.

Yeah, we have done for a long time. That's called a wide row, right? Or a
compound primary key.

It can handle some millions of columns, but not more like 10M. I mean, a
request for such a row concentrates on a particular node, so the
performance degrades.

> I also had idea for semi-ordered partitioner - instead of single MD5, to
have two MD5's.

Sounds interesting. But, we need a fully ordered result.

Anyway, I will try with the latest version.

Thanks,
Takenori


On Thu, Aug 22, 2013 at 6:12 PM, Nikolay Mihaylov  wrote:

> my five cents -
> token and key are not same. it was like this long time ago (single MD5
> assumed single key)
>
> if you want ordered, you probably can arrange your data in a way so you
> can get it in ordered fashion.
> for example long ago, i had single column family with single key and about
> 2-3 M columns - I do not suggest you to do it this way, because is wrong
> way, but it is easy to understand the idea.
>
> I also had idea for semi-ordered partitioner - instead of single MD5, to
> have two MD5's.
> then you can get semi-ordered ranges, e.g. you get ordered all cities in
> Canada, all cities in US and so on.
> however in this way things may get pretty non-ballanced
>
> Nick
>
>
>
>
>
> On Thu, Aug 22, 2013 at 11:19 AM, Takenori Sato wrote:
>
>> Hi,
>>
>> I am trying to implement a custom partitioner that evenly distributes,
>> yet preserves order.
>>
>> The partitioner returns a token by BigInteger as RandomPartitioner does,
>> while does a decorated key by string as OrderPreservingPartitioner does.
>> * for now, since IPartitioner does not support different types for
>> token and key, BigInteger is simply converted to string
>>
>> Then, I played around with cassandra-cli. As expected, in my 3 nodes test
>> cluster, get/set worked, but list(get_range_slices) didn't.
>>
>> This came from a challenge to overcome a wide row scalability. So, I want
>> to make it work!
>>
>> I am aware that some efforts are required to make get_range_slices work.
>> But are there any other critical problems? For example, it seems there is
>> an assumption that token and key are the same. If this is throughout the
>> whole C* code, this partitioner is not practical.
>>
>> Or have your tried something similar?
>>
>> I would appreciate your feedback!
>>
>> Thanks,
>> Takenori
>>
>
>


OrderPreservingPartitioner in 1.2

2013-08-23 Thread Takenori Sato
Hi,

I know it has been depreciated, but OrderPreservingPartitioner still works
with 1.2?

Just wanted to know how it works, but I got a couple of exceptions as below:

ERROR [GossipStage:2] 2013-08-23 07:03:57,171 CassandraDaemon.java (line
175) Exception in thread Thread[GossipStage:2,5,main]
java.lang.RuntimeException: The provided key was not UTF8 encoded.
at
org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:233)
at
org.apache.cassandra.dht.OrderPreservingPartitioner.decorateKey(OrderPreservingPartitioner.java:53)
at org.apache.cassandra.db.Table.apply(Table.java:379)
at org.apache.cassandra.db.Table.apply(Table.java:353)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:258)
at
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:117)
at
org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:172)
at org.apache.cassandra.db.SystemTable.updatePeerInfo(SystemTable.java:258)
at
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1228)
at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:935)
at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:926)
at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:884)
at
org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:57)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:260)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:781)
at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:167)
at org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:124)
at
org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:229)
... 16 more

The key was "0ab68145" in HEX, that contains some control characters.

Another exception is this:

 INFO [main] 2013-08-23 07:04:27,659 StorageService.java (line 891)
JOINING: Starting to bootstrap...
DEBUG [main] 2013-08-23 07:04:27,659 BootStrapper.java (line 73) Beginning
bootstrap process
ERROR [main] 2013-08-23 07:04:27,666 CassandraDaemon.java (line 430)
Exception encountered during startup
java.lang.IllegalStateException: No sources found for (H,H]
at
org.apache.cassandra.dht.RangeStreamer.getAllRangesWithSourcesFor(RangeStreamer.java:163)
at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:121)
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81)
at
org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924)
at
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693)
at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:548)
at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:445)
at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325)
at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413)
at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456)
ERROR [StorageServiceShutdownHook] 2013-08-23 07:04:27,672
CassandraDaemon.java (line 175) Exception in thread
Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
at
org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
at
org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:362)
at
org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
at
org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:513)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:662)

I tried to setup 3 nodes cluster with tokens, A, H, P for each. This error
was raised by the second node with the token, H.

Thanks,
Takenori


Re: Random Distribution, yet Order Preserving Partitioner

2013-08-27 Thread Takenori Sato
Hi Manoj,

Thanks for your advise.

More or less, basically we do the same. As you pointed out, we now face
with many cases that can not be solved by data modeling, and which are
reaching to 100 millions of columns.

We can split them down to multiple pieces of metadata rows, but that will
bring more complexity, thus error prone. If possible, want to avoid that.

- Takenori

2013/08/27 21:37、Manoj Mainali  のメッセージ:

Hi Takenori,

I can't tell for sure without knowing what kind of data you have and how
much you have.You can use the random partitioner and use the concept of
metadata row that stores the row key, as for example like below

{metadata_row}: key1 | key2 | key3
key1:column1 | column2

 When you do the read you can always directly query by the key, if you
already know it. In the case of range queries, first you query the
metadata_row and get the keys you want in the ordered fashion. Then you can
do multi_get to get you actual data.

The downside is you have to do two read queries, and depending on how much
data you have you will end up with a wide metadata row.

Manoj


On Fri, Aug 23, 2013 at 8:47 AM, Takenori Sato  wrote:

> Hi Nick,
>
> > token and key are not same. it was like this long time ago (single MD5
> assumed single key)
>
> True. That reminds me of making a test with the latest 1.2 instead of our
> current 1.0!
>
> > if you want ordered, you probably can arrange your data in a way so you
> can get it in ordered fashion.
>
> Yeah, we have done for a long time. That's called a wide row, right? Or a
> compound primary key.
>
> It can handle some millions of columns, but not more like 10M. I mean, a
> request for such a row concentrates on a particular node, so the
> performance degrades.
>
> > I also had idea for semi-ordered partitioner - instead of single MD5,
> to have two MD5's.
>
> Sounds interesting. But, we need a fully ordered result.
>
> Anyway, I will try with the latest version.
>
> Thanks,
> Takenori
>
>
> On Thu, Aug 22, 2013 at 6:12 PM, Nikolay Mihaylov  wrote:
>
>> my five cents -
>> token and key are not same. it was like this long time ago (single MD5
>> assumed single key)
>>
>> if you want ordered, you probably can arrange your data in a way so you
>> can get it in ordered fashion.
>> for example long ago, i had single column family with single key and
>> about 2-3 M columns - I do not suggest you to do it this way, because is
>> wrong way, but it is easy to understand the idea.
>>
>> I also had idea for semi-ordered partitioner - instead of single MD5, to
>> have two MD5's.
>> then you can get semi-ordered ranges, e.g. you get ordered all cities in
>> Canada, all cities in US and so on.
>> however in this way things may get pretty non-ballanced
>>
>> Nick
>>
>>
>>
>>
>>
>> On Thu, Aug 22, 2013 at 11:19 AM, Takenori Sato wrote:
>>
>>> Hi,
>>>
>>> I am trying to implement a custom partitioner that evenly distributes,
>>> yet preserves order.
>>>
>>> The partitioner returns a token by BigInteger as RandomPartitioner does,
>>> while does a decorated key by string as OrderPreservingPartitioner does.
>>> * for now, since IPartitioner does not support different types for
>>> token and key, BigInteger is simply converted to string
>>>
>>> Then, I played around with cassandra-cli. As expected, in my 3 nodes
>>> test cluster, get/set worked, but list(get_range_slices) didn't.
>>>
>>> This came from a challenge to overcome a wide row scalability. So, I
>>> want to make it work!
>>>
>>> I am aware that some efforts are required to make get_range_slices work.
>>> But are there any other critical problems? For example, it seems there is
>>> an assumption that token and key are the same. If this is throughout the
>>> whole C* code, this partitioner is not practical.
>>>
>>> Or have your tried something similar?
>>>
>>> I would appreciate your feedback!
>>>
>>> Thanks,
>>> Takenori
>>>
>>
>>
>


/proc/sys/vm/zone_reclaim_mode

2013-09-09 Thread Takenori Sato
Hi,

I am investigating NUMA issues.

I have been aware that bin/cassandra tries to use interleave all policy if
available.

https://issues.apache.org/jira/browse/CASSANDRA-2594
https://issues.apache.org/jira/browse/CASSANDRA-3245

So what about /proc/sys/vm/zone_reclaim_mode? Any recommendations? I didn't
find any in respect to Cassandra.

By default on Linux NUMA machine, this is set 1 that tries to reclaim some
pages in a zone rather than acquiring others from the other zones.

Explicitly disabling this sounds better.

"It may be beneficial to switch off zone reclaim if the system is used for
a file server and all of memory should be used for caching files from disk.
In that case the caching effect is more important than data locality."
https://www.kernel.org/doc/Documentation/sysctl/vm.txt

Thanks!
Takenori


Re: questions related to the SSTable file

2013-09-17 Thread Takenori Sato
> So in fact, incremental backup of Cassandra is just hard link all the new
SSTable files being generated during the incremental backup period. It
could contain any data, not just the data being update/insert/delete in
this period, correct?

Correct.

But over time, some old enough SSTable files are usually shared across
multiple snapshots.


On Wed, Sep 18, 2013 at 3:37 AM, java8964 java8964 wrote:

> Another question related to the SSTable files generated in the incremental
> backup is not really ONLY incremental delta, right? It will include more
> than delta in the SSTable files.
>
> I will use the example to show my question:
>
> first, we have this data in the SSTable file 1:
>
> rowkey(1), columns (maker=honda).
>
> later, if we add one column in the same key:
>
> rowkey(1), columns (maker=honda, color=blue)
>
> The data above being flushed to another SSTable file 2. In this case, it
> will be part of the incremental backup at this time. But in fact, it will
> contain both old data (make=honda), plus new changes (color=blue).
>
> So in fact, incremental backup of Cassandra is just hard link all the new
> SSTable files being generated during the incremental backup period. It
> could contain any data, not just the data being update/insert/delete in
> this period, correct?
>
> Thanks
>
> Yong
>
> > From: dean.hil...@nrel.gov
> > To: user@cassandra.apache.org
> > Date: Tue, 17 Sep 2013 08:11:36 -0600
>
> > Subject: Re: questions related to the SSTable file
> >
> > Netflix created file streaming in astyanax into cassandra specifically
> because writing too big a column cell is a bad thing. The limit is really
> dependent on use case….do you have servers writing 1000's of 200Meg files
> at the same time….if so, astyanax streaming may be a better way to go there
> where it divides up the file amongst cells and rows.
> >
> > I know the limit of a row size is really your hard disk space and the
> column count if I remember goes into billions though realistically, I think
> beyond 10 million might slow down a bit….all I know is we tested up to 10
> million columns with no issues in our use-case.
> >
> > So you mean at this time, I could get 2 SSTable files, both contain
> column "Blue" for the same row key, right?
> >
> > Yes
> >
> > In this case, I should be fine as value of the "Blue" column contain the
> timestamp to help me to find out which is the last change, right?
> >
> > Yes
> >
> > In MR world, each file COULD be processed by different Mapper, but will
> be sent to the same reducer as both data will be shared same key.
> >
> > If that is the way you are writing it, then yes
> >
> > Dean
> >
> > From: Shahab Yunus mailto:shahab.yu...@gmail.com
> >>
> > Reply-To: "user@cassandra.apache.org"
> mailto:user@cassandra.apache.org>>
> > Date: Tuesday, September 17, 2013 7:54 AM
> > To: "user@cassandra.apache.org" <
> user@cassandra.apache.org>
> > Subject: Re: questions related to the SSTable file
> >
> > derstand if following changes apply to the same row key as above
> example, additional SSTable file could be generated. That is
>


Re: questions related to the SSTable file

2013-09-17 Thread Takenori Sato
Yong,

It seems there is still a misunderstanding.

> But there is no way we can be sure that these SSTable files will ONLY
contain modified data. So the statement being quoted above is not exactly
right. I agree that all the modified data in that period will be in the
incremental sstable files, but a lot of other unmodified data will be in
them too.

memtable(a new sstable) contains only modified data as I explained by the
example.

> If we have 2 rows data with different row key in the same memtable, and
if only 2nd row being modified. When the memtable is flushed to SSTable
file, it will contain both rows, and both will be in the incremental backup
files. So for first row, nothing change, but it will be in the incremental
backup.

Unless the first row is modified, it does not exist in memtable at all.

> If I have one row with one column, now a new column is added, and whole
row in one memtable being flushed to SSTable file, as also in this
incremental backup. For first column, nothing change, but it will still be
in incremental backup file.

For example, if it works as you understand, then, Color-2 should contain
two more rows, Lavender, and Blue with an existing column, hex, like the
following. But it's not.

- Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #FF}}]
- Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]

--> your understanding
- Color-2-Data.db: [{Lavender: {hex: #E6E6FA}}, {Green: {hex: #008000}},
{Blue: {hex: #FF}, {hex2: #2c86ff}}]
* Row, Lavender, and Column Blue's hex have no changes


> The point I tried to make is this is important if I design an ETL to
consume the incremental backup SSTable files. As above example, I have to
realize that in the incremental backup sstable files, they could or most
likely contain old data which was previous being processed already. That
will require additional logic and responsibility in the ETL to handle it,
or any outsider SSTable consumer to pay attention to it.

I suggest to try org.apache.cassandra.tools.SSTableExport, then you will
see what's going on under the hood.

- Takenori








On Wed, Sep 18, 2013 at 10:51 AM, java8964 java8964 wrote:

> Quote:
>
> "
> To be clear, "incremental backup" feature backs up the data being modified
> in that period, because it writes only those files to the incremental
> backup dir as hard links, between full snapshots.
> "
>
> I thought I was clearer, but your clarification confused me again.
> My understanding so far from all the answer I got so far, I believe, the
> more accurate statement of "incremental backup" should be "incremental
> backup" feature backs up the SSTable files being generated in that period.
>
> But there is no way we can be sure that these SSTable files will ONLY
> contain modified data. So the statement being quoted above is not exactly
> right. I agree that all the modified data in that period will be in the
> incremental sstable files, but a lot of other unmodified data will be in
> them too.
>
> If we have 2 rows data with different row key in the same memtable, and if
> only 2nd row being modified. When the memtable is flushed to SSTable file,
> it will contain both rows, and both will be in the incremental backup
> files. So for first row, nothing change, but it will be in the incremental
> backup.
>
> If I have one row with one column, now a new column is added, and whole
> row in one memtable being flushed to SSTable file, as also in this
> incremental backup. For first column, nothing change, but it will still be
> in incremental backup file.
>
> The point I tried to make is this is important if I design an ETL to
> consume the incremental backup SSTable files. As above example, I have to
> realize that in the incremental backup sstable files, they could or most
> likely contain old data which was previous being processed already. That
> will require additional logic and responsibility in the ETL to handle it,
> or any outsider SSTable consumer to pay attention to it.
>
> Yong
>
> --
> Date: Tue, 17 Sep 2013 18:01:45 -0700
>
> Subject: Re: questions related to the SSTable file
> From: rc...@eventbrite.com
> To: user@cassandra.apache.org
>
>
> On Tue, Sep 17, 2013 at 5:46 PM, Takenori Sato  wrote:
>
> > So in fact, incremental backup of Cassandra is just hard link all the
> new SSTable files being generated during the incremental backup period. It
> could contain any data, not just the data being update/insert/delete in
> this period, correct?
>
> Correct.
>
> But over time, some old enough SSTable files are usually shared across
> multiple snapshots.
>
>
> To be clear, "incremental backup" feature backs up the data being modified
> in that p

Re: CPU hotspot at BloomFilterSerializer#deserialize

2013-02-05 Thread Takenori Sato(Cloudian)

Hi,

We found this issue is specific to 1.0.1 through 1.0.8, which was fixed 
at 1.0.9.


https://issues.apache.org/jira/browse/CASSANDRA-4023

So by upgrading, we will see a reasonable performnace no matter how 
large row we have!


Thanks,
Takenori

(2013/02/05 2:29), aaron morton wrote:
Yes, it contains a big row that goes up to 2GB with more than a 
million of columns.
I've run tests with 10 million small columns and reasonable 
performance. I've not looked at 1 million large columns.


- BloomFilterSerializer#deserialize does readLong iteratively at 
each page

of size 4K for a given row, which means it could be 500,000 loops(calls
readLong) for a 2G row(from 1.0.7 source).
There is only one Bloom filter per row in an SSTable, not one per 
column index/page.


It could take a while if there are a lot of sstables in the read.

nodetool cfhistorgrams will let you know, run it once to reset the 
counts , then do your test, then run it again.


Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 4/02/2013, at 4:13 AM, Edward Capriolo <mailto:edlinuxg...@gmail.com>> wrote:



It is interesting the press c* got about having 2 billion columns in a
row. You *can* do it but it brings to light some realities of what
that means.

On Sun, Feb 3, 2013 at 8:09 AM, Takenori Sato <mailto:ts...@cloudian.com>> wrote:

Hi Aaron,

Thanks for your answers. That helped me get a big picture.

Yes, it contains a big row that goes up to 2GB with more than a 
million of

columns.

Let me confirm if I correctly understand.

- The stack trace is from Slice By Names query. And the 
deserialization is

at the step 3, "Read the row level Bloom Filter", on your blog.

- BloomFilterSerializer#deserialize does readLong iteratively at 
each page

of size 4K for a given row, which means it could be 500,000 loops(calls
readLong) for a 2G row(from 1.0.7 source).

Correct?

That makes sense Slice By Names queries against such a wide row 
could be CPU

bottleneck. In fact, in our test environment, a
BloomFilterSerializer#deserialize of such a case takes more than 
10ms, up to

100ms.


Get a single named column.
Get the first 10 columns using the natural column order.
Get the last 10 columns using the reversed order.


Interesting. A query pattern could make a difference?

We thought the only solutions is to change the data structure(don't 
use such

a wide row if it is retrieved by Slice By Names query).

Anyway, will give it a try!

Best,
Takenori

On Sat, Feb 2, 2013 at 2:55 AM, aaron morton 
mailto:aa...@thelastpickle.com>>

wrote:


5. the problematic Data file contains only 5 to 10 keys data but
large(2.4G)

So very large rows ?
What does nodetool cfstats or cfhistograms say about the row sizes ?


1. what is happening?

I think this is partially large rows and partially the query 
pattern, this

is only by roughly correct
http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ and my 
talk here

http://www.datastax.com/events/cassandrasummit2012/presentations

3. any more info required to proceed?

Do some tests with different query techniques…

Get a single named column.
Get the first 10 columns using the natural column order.
Get the last 10 columns using the reversed order.

Hope that helps.

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 31/01/2013, at 7:20 PM, Takenori Sato  wrote:

Hi all,

We have a situation that CPU loads on some of our nodes in a 
cluster has
spiked occasionally since the last November, which is triggered by 
requests

for rows that reside on two specific sstables.

We confirmed the followings(when spiked):

version: 1.0.7(current) <- 0.8.6 <- 0.8.5 <- 0.7.8
jdk: Oracle 1.6.0

1. a profiling showed that BloomFilterSerializer#deserialize was the
hotspot(70% of the total load by running threads)

* the stack trace looked like this(simplified)
90.4% - org.apache.cassandra.db.ReadVerbHandler.doVerb
90.4% - org.apache.cassandra.db.SliceByNamesReadCommand.getRow
...
90.4% - 
org.apache.cassandra.db.CollationController.collectTimeOrderedData

...
89.5% - 
org.apache.cassandra.db.columniterator.SSTableNamesIterator.read

...
79.9% - org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter
68.9% - 
org.apache.cassandra.io.sstable.BloomFilterSerializer.deserialize

66.7% - java.io.DataInputStream.readLong

2. Usually, 1 should be so fast that a profiling by sampling can not
detect

3. no pressure on Cassandra's VM heap nor on machine in overal

4. a little I/O traffic for our 8 disks/node(up to 100tps/disk by 
"iostat

1 1000")

5. the problematic Data file contains only 5 to 10 keys data but
large(2.4G)

6. the problematic Filter file size is only 256B(could be normal)


So now, I am trying to read the Filter file in the same way
BloomFilterSerializer#deserialize does as possible as I 

Re: -pr vs. no -pr

2013-02-28 Thread Takenori Sato(Cloudian)

Hi,

Please note that I confirmed on v1.0.7.

> I mean a repair involves all three nodes and pushes and pulls data, 
right?


Yes, but that's how -pr works. A repair without -pr does more.

For example, suppose you have a ring with RF=3 like this.

A - B - C - D - E - F

Then, a repair on A without -pr does for 3 ranges as follows:
[A, B, C]
[E, F, A]
[F, A, B]

Among them, the first one, [A, B, C] is the primary range of A.

So, with -pr, a repair runs only for:
[A, B, C]

> I could run nodetool repair on just 2 nodes(RF=3) instead of using 
nodetool repair –pr???


Yes.

You need to run two repairs on A and D.

> What is the advantage of –pr then?

Whenever you want to minimize rapair impacts.

For example, suppose you got one node down for a while, and bring it 
back to the cluster.


You need to run rapair without affecting the entire cluster. Then, -pr 
is the option.


Thanks,
Takenori

(2013/03/01 7:39), Hiller, Dean wrote:

Isn't it true if I have 6 nodes, I could run nodetool repair on just 2 
nodes(RF=3) instead of using nodetool repair –pr???

What is the advantage of –pr then?

I mean a repair involves all three nodes and pushes and pulls data, right?

Thanks,
Dean




Re: Cleanup understastanding

2013-05-28 Thread Takenori Sato(Cloudian)

Hi Victor,

As Andrey said, running cleanup doesn't work as you expect.

> The reason I need to clean things is that I wont need most of my 
inserted data on the next day.


Deleted objects(columns/records) become deletable from sstable file when 
they get expired(after gc_grace_seconds).


Such deletable objects are actually gotten rid of by compaction.

The tricky part is that a deletable object remains unless all of its old 
objects(the same row key) are contained in the set of sstable files 
involved in the compaction.


- Takenori

(2013/05/29 3:01), Andrey Ilinykh wrote:
cleanup removes data which doesn't belong to the current node. You 
have to run it only if you move (or add new) nodes. In your case there 
is no any reason to do it.



On Tue, May 28, 2013 at 7:39 AM, Víctor Hugo Oliveira Molinar 
mailto:vhmoli...@gmail.com>> wrote:


Hello everyone.
I have a daily maintenance task at c* which does:

-truncate cfs
-clearsnapshots
-repair
-cleanup

The reason I need to clean things is that I wont need most of my
inserted data on the next day. It's kind a business requirement.

Well,  the problem I'm running to, is the misunderstanding about
cleanup operation.
I have 2 nodes with lower than half usage of disk, which is
moreless 13GB;

But, the last few days, arbitrarily each node have reported me a
cleanup error indicating that the disk was full. Which is not true.

/Error occured during cleanup/
/java.util.concurrent.ExecutionException: java.io.IOException:
disk full/


So I'd like to know more about what does happens in a cleanup
operation.
Appreciate any help.






Re: OrderPreservingPartitioner in 1.2

2013-08-25 Thread Takenori Sato(Cloudian)

From the Jira,

> One possibility is that getToken of OPP can return hex value if it 
fails to encode bytes to UTF-8 instead of throwing error. By this system 
tables seem to be working fine with OPP.


This looks like an option to try for me.

Thanks!

(2013/08/23 20:44), Vara Kumar wrote:
For the first exception: OPP was not working in 1.2. It has been fixed 
but not yet there in latest 1.2.8 version.


Jira issue about it: https://issues.apache.org/jira/browse/CASSANDRA-5793


On Fri, Aug 23, 2013 at 12:51 PM, Takenori Sato <mailto:ts...@cloudian.com>> wrote:


Hi,

I know it has been depreciated, but OrderPreservingPartitioner
still works with 1.2?

Just wanted to know how it works, but I got a couple of exceptions
as below:

ERROR [GossipStage:2] 2013-08-23 07:03:57,171 CassandraDaemon.java
(line 175) Exception in thread Thread[GossipStage:2,5,main]
java.lang.RuntimeException: The provided key was not UTF8 encoded.
at

org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:233)
at

org.apache.cassandra.dht.OrderPreservingPartitioner.decorateKey(OrderPreservingPartitioner.java:53)
at org.apache.cassandra.db.Table.apply(Table.java:379)
at org.apache.cassandra.db.Table.apply(Table.java:353)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:258)
at

org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:117)
at

org.apache.cassandra.cql3.QueryProcessor.processInternal(QueryProcessor.java:172)
at
org.apache.cassandra.db.SystemTable.updatePeerInfo(SystemTable.java:258)
at

org.apache.cassandra.service.StorageService.onChange(StorageService.java:1228)
at
org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:935)
at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:926)
at
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:884)
at

org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:57)
at

org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:260)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:781)
at
org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:167)
at
org.apache.cassandra.utils.ByteBufferUtil.string(ByteBufferUtil.java:124)
at

org.apache.cassandra.dht.OrderPreservingPartitioner.getToken(OrderPreservingPartitioner.java:229)
... 16 more

The key was "0ab68145" in HEX, that contains some control characters.

Another exception is this:

 INFO [main] 2013-08-23 07:04:27,659 StorageService.java (line
891) JOINING: Starting to bootstrap...
DEBUG [main] 2013-08-23 07:04:27,659 BootStrapper.java (line 73)
Beginning bootstrap process
ERROR [main] 2013-08-23 07:04:27,666 CassandraDaemon.java (line
430) Exception encountered during startup
java.lang.IllegalStateException: No sources found for (H,H]
at

org.apache.cassandra.dht.RangeStreamer.getAllRangesWithSourcesFor(RangeStreamer.java:163)
at
org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:121)
at
org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81)
at

org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924)
at

org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693)
at

org.apache.cassandra.service.StorageService.initServer(StorageService.java:548)
at

org.apache.cassandra.service.StorageService.initServer(StorageService.java:445)
at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325)
at

org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413)
at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456)
ERROR [StorageServiceShutdownHook] 2013-08-23 07:04:27,672
CassandraDaemon.java (line 175) Exception in thread
Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
at

org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
at

org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:362)
at

org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
at

org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:513)
at
org.apache.cassandra.uti

Re: questions related to the SSTable file

2013-09-16 Thread Takenori Sato(Cloudian)

Hi,

> 1) I will expect same row key could show up in both sstable2json 
output, as this one row exists in both SSTable files, right?


Yes.

> 2) If so, what is the boundary? Will Cassandra guarantee the column 
level as the boundary? What I mean is that for one column's data, it 
will be guaranteed to be either in the first file, or 2nd file, right? 
There is no chance that Cassandra will cut the data of one column into 2 
part, and one part stored in first SSTable file, and the other part 
stored in second SSTable file. Is my understanding correct?


No.

> 3) If what we are talking about are only the SSTable files in 
snapshot, incremental backup SSTable files, exclude the runtime SSTable 
files, will anything change? For snapshot or incremental backup SSTable 
files, first can one row data still may exist in more than one SSTable 
file? And any boundary change in this case?
> 4) If I want to use incremental backup SSTable files as the way to 
catch data being changed, is it a good way to do what I try to archive? 
In this case, what happen in the following example:


I don't fully understand, but snapshot will do. It will create hard 
links to all the SSTable files present at snapshot.



Let me explain how SSTable and compaction works.

Suppose we have 4 files being compacted(the last one has bee just 
flushed, then which triggered compaction). Note that file names are 
simplified.


- Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #FF}}]
- Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]
- Color-3-Data.db: [{Aqua: {hex: #00}}, {Green: {hex2: #32CD32}}, 
{Blue: {}}]

- Color-4-Data.db: [{Magenta: {hex: #FF00FF}}, {Gold: {hex: #FFD700}}]

They are created by the following operations.

- Add a row of (key, column, column_value = Blue, hex, #FF)
- Add a row of (key, column, column_value = Lavender, hex, #E6E6FA)
 memtable is flushed => Color-1-Data.db 
- Add a row of (key, column, column_value = Green, hex, #008000)
- Add a column of (key, column, column_value = Blue, hex2, #2c86ff)
 memtable is flushed => Color-2-Data.db 
- Add a column of (key, column, column_value = Green, hex2, #32CD32)
- Add a row of (key, column, column_value = Aqua, hex, #00)
- Delete a row of (key = Blue)
 memtable is flushed => Color-3-Data.db 
- Add a row of (key, column, column_value = Magenta, hex, #FF00FF)
- Add a row of (key, column, column_value = Gold, hex, #FFD700)
 memtable is flushed => Color-4-Data.db 

Then, a compaction will merge all those fragments together into the 
latest ones as follows.


- Color-5-Data.db: [{Lavender: {hex: #E6E6FA}, {Aqua: {hex: #00}, 
{Green: {hex: #008000, hex2: #32CD32}}, {Magenta: {hex: #FF00FF}}, 
{Gold: {hex: #FFD700}}]

* assuming RandomPartitioner is used

Hope they would help.

- Takenori

(2013/09/17 10:51), java8964 java8964 wrote:
Hi, I have some questions related to the SSTable in the Cassandra, as 
I am doing a project to use it and hope someone in this list can share 
some thoughts.


My understand is the SSTable is per column family. But each column 
family could have multi SSTable files. During the runtime, one row 
COULD split into more than one SSTable file, even this is not good for 
performance, but it does happen, and Cassandra will try to merge and 
store one row data into one SSTable file during compassion.


The question is when one row is split in multi SSTable files, what is 
the boundary? Or let me ask this way, if one row exists in 2 SSTable 
files, if I run sstable2json tool to run on both SSTable files 
individually:


1) I will expect same row key could show up in both sstable2json 
output, as this one row exists in both SSTable files, right?
2) If so, what is the boundary? Will Cassandra guarantee the column 
level as the boundary? What I mean is that for one column's data, it 
will be guaranteed to be either in the first file, or 2nd file, right? 
There is no chance that Cassandra will cut the data of one column into 
2 part, and one part stored in first SSTable file, and the other part 
stored in second SSTable file. Is my understanding correct?
3) If what we are talking about are only the SSTable files in 
snapshot, incremental backup SSTable files, exclude the runtime 
SSTable files, will anything change? For snapshot or incremental 
backup SSTable files, first can one row data still may exist in more 
than one SSTable file? And any boundary change in this case?
4) If I want to use incremental backup SSTable files as the way to 
catch data being changed, is it a good way to do what I try to 
archive? In this case, what happen in the following example:


For column family A:
at Time 0, one row key (key1) has some data. It is being stored and 
back up in SSTable file 1.
at Time 1, if any column for key1 has any change (a new column insert, 
a column updated/deleted, or even whole row being deleted), I will 
expect this whole row exists in the any incremental backup SSTable 

Re: questions related to the SSTable file

2013-09-17 Thread Takenori Sato(Cloudian)

Thanks, Rob, for clarifying!

- Takenori

(2013/09/18 10:01), Robert Coli wrote:
On Tue, Sep 17, 2013 at 5:46 PM, Takenori Sato <mailto:ts...@cloudian.com>> wrote:


> So in fact, incremental backup of Cassandra is just hard link
all the new SSTable files being generated during the incremental
backup period. It could contain any data, not just the data being
update/insert/delete in this period, correct?

Correct.

But over time, some old enough SSTable files are usually shared
across multiple snapshots.


To be clear, "incremental backup" feature backs up the data being 
modified in that period, because it writes only those files to the 
incremental backup dir as hard links, between full snapshots.


http://www.datastax.com/docs/1.0/operations/backup_restore
"
When incremental backups are enabled (disabled by default), Cassandra 
hard-links each flushed SSTable to a backups directory under the 
keyspace data directory. This allows you to store backups offsite 
without transferring entire snapshots. Also, incremental backups 
combine with snapshots to provide a dependable, up-to-date backup 
mechanism.

"

What Takenori is referring to is that a full snapshot is in some ways 
an "incremental backup" because it shares hard linked SSTables with 
other snapshots.


=Rob