from:"Bryan Talbot"

is "Not a time-based UUID" serious?

2012-09-12 Thread Bryan Talbot

I'm testing upgrading a multi-node cluster from 1.0.9 to 1.1.5 and ran into
the error message described here:
https://issues.apache.org/jira/browse/CASSANDRA-4195

What I can't tell is if this is a serious issue or if it can be safely
ignored.

If it is a serious issue, shouldn't the migration guides for 1.1.x require
that upgrades cannot be rolling or that all nodes must be running 1.0.11 or
greater first?


2012-09-11 17:12:46,299 [GossipStage:1] ERROR
org.apache.cassandra.service.AbstractCassandraDaemon  - Fatal exception in
thread Thread[GossipStage:1,5,main]
java.lang.UnsupportedOperationException: Not a time-based UUID
at java.util.UUID.timestamp(UUID.java:308)
at
org.apache.cassandra.service.MigrationManager.updateHighestKnown(MigrationManager.java:121)
at
org.apache.cassandra.service.MigrationManager.rectify(MigrationManager.java:99)
at
org.apache.cassandra.service.MigrationManager.onAlive(MigrationManager.java:83)
at org.apache.cassandra.gms.Gossiper.markAlive(Gossiper.java:806)
at
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:849)
at
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:908)
at
org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:68)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


-Bryan

Re: is "Not a time-based UUID" serious?

2012-09-12 Thread Bryan Talbot

To answer my own question: yes, the error is fatal.  This also means that
upgrades to 1.1.x from 1.0.x MUST use 1.0.11 or greater it seems to be
successful.

My test upgrade from 1.0.9 to 1.1.5 left the cluster in a state that wasn't
able to come to a schema agreement and blocked schema changes.

-Bryan


On Wed, Sep 12, 2012 at 2:42 PM, Bryan Talbot wrote:

> I'm testing upgrading a multi-node cluster from 1.0.9 to 1.1.5 and ran
> into the error message described here:
> https://issues.apache.org/jira/browse/CASSANDRA-4195
>
> What I can't tell is if this is a serious issue or if it can be safely
> ignored.
>
> If it is a serious issue, shouldn't the migration guides for 1.1.x require
> that upgrades cannot be rolling or that all nodes must be running 1.0.11 or
> greater first?
>
>
> 2012-09-11 17:12:46,299 [GossipStage:1] ERROR
> org.apache.cassandra.service.AbstractCassandraDaemon  - Fatal exception in
> thread Thread[GossipStage:1,5,main]
> java.lang.UnsupportedOperationException: Not a time-based UUID
> at java.util.UUID.timestamp(UUID.java:308)
> at
> org.apache.cassandra.service.MigrationManager.updateHighestKnown(MigrationManager.java:121)
> at
> org.apache.cassandra.service.MigrationManager.rectify(MigrationManager.java:99)
> at
> org.apache.cassandra.service.MigrationManager.onAlive(MigrationManager.java:83)
> at org.apache.cassandra.gms.Gossiper.markAlive(Gossiper.java:806)
> at
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:849)
> at
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:908)
> at
> org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:68)
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
>
>
> -Bryan
>
>

Re: what's the most 1.1 stable version?

2012-10-05 Thread Bryan Talbot

We've been using 1.1.5 for a few weeks now and it's been stable for our
uses.  Also, make sure you upgrade to a more recent version of 1.0 branch
before going to 1.1.  Version 1.0.7 was released before 1.1 and there are
upgrade-path fixed applied to 1.0 after that.  Our upgrade path was 1.0.9
-> 1.0.11 -> 1.1.5 which worked well.

-Bryan


On Fri, Oct 5, 2012 at 8:01 AM, Andrey Ilinykh  wrote:

> In 1.1.5 file descriptor leak was fixed. In my case it was critical.
> Nodes went down every several days. But not everyone had this problem.
>
> Thank you,
>   Andrey
>
> On Fri, Oct 5, 2012 at 7:42 AM, Alexandru Sicoe  wrote:
> > Hello,
> >  We are planning to upgrade from version 1.0.7 to the 1.1 branch. Which
> is
> > the stable version that people are using? I see the latest release is
> 1.1.5
> > but maybe it's not fully wise to use this. Is 1.1.4 the one to use?
> >
> > Cheers,
> > Alex
>



-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

MBean cassandra.db.CompactionManager TotalBytesCompacted counts backwards

2012-10-05 Thread Bryan Talbot

I've recently added compaction rate (in bytes / second) to my monitors for
cassandra and am seeing some odd values.  I wasn't expecting the values for
TotalBytesCompacted to sometimes decrease from one reading to the next.  It
seems that the value should be monotonically increasing while a server is
running -- obviously it would start again at 0 when the server is restarted
or if the counter rolls over (unlikely for a 64 bit long).

Below are two samples taken 60 seconds apart: the value decreased by
2,954,369,012 between the two readings.

reported_metric=[timestamp:1349476449, status:200,
request:[mbean:org.apache.cassandra.db:type=CompactionManager,
attribute:TotalBytesCompacted, type:read], value:7548675470069]

previous_metric=[timestamp:1349476389, status:200,
request:[mbean:org.apache.cassandra.db:type=CompactionManager,
attribute:TotalBytesCompacted, type:read], value:7551629839081]


I briefly looked at the code for CompactionManager and a few related
classes and don't see anyplace that is performing subtraction explicitly;
however, there are many additions of signed long values that are not
validated and could conceivably contain a negative value thus causing the
totalBytesCompacted to decrease.  It's interesting to note that the all of
the differences I've seen so far are more than the overflow value of a
signed 32 bit value.  The OS (CentOS 5.7) and sun java vm (1.6.0_29) are
both 64 bit.  JNA is enabled.

Is this expected and normal?  If so, what is the correct interpretation of
this metric?  I'm seeing the negatives values a few times per hour when
reading it once every 60 seconds.

-Bryan

Re: MBean cassandra.db.CompactionManager TotalBytesCompacted counts backwards

2012-10-08 Thread Bryan Talbot

I'm attempting to plot how "busy" the node is doing compactions but there
seems to only be a few metrics reported that might be suitable:
CompletedTasks, PendingTasks, TotalBytesCompacted,
TotalCompactionsCompleted.

It's not clear to me what the difference between CompletedTasks and
TotalCompactionsCompleted is but I am plotting TotalCompactionsCompleted /
sec as one metric; however, this rate is nearly always less than 1 and
doesn't capture how much resources are used doing the compaction.  A
compaction of 4 smallest SSTables counts the same as a compaction of 4
largest SSTables but the cost is hugely different.  Thus, I'm also plotting
TotalBytesCompacted / sec.

Since the TotalBytesCompacted value sometimes moves backwards I'm not
confident that it's reporting what it is meant to report.  The code and
comments indicate that it should only be incremented by the final size of
the newly created SSTable or by the bytes-compacted-so-far for a larger
compaction, so I don't see why it should be reasonable for it to sometimes
decrease.

How should the impact of compaction be measured if not by bytes compacted?

-Bryan

On Sun, Oct 7, 2012 at 7:39 AM, Edward Capriolo wrote:

> I have not looked at this JMX object in a while, however the
> compaction manager can support multiple threads. Also it moves from
> 0-filesize each time it has to compact a set of files.
>
> That is more useful for showing current progress rather then lifetime
> history.
>
>
>
> On Fri, Oct 5, 2012 at 7:27 PM, Bryan Talbot 
> wrote:
> > I've recently added compaction rate (in bytes / second) to my monitors
> for
> > cassandra and am seeing some odd values.  I wasn't expecting the values
> for
> > TotalBytesCompacted to sometimes decrease from one reading to the next.
>  It
> > seems that the value should be monotonically increasing while a server is
> > running -- obviously it would start again at 0 when the server is
> restarted
> > or if the counter rolls over (unlikely for a 64 bit long).
> >
> > Below are two samples taken 60 seconds apart: the value decreased by
> > 2,954,369,012 between the two readings.
> >
> > reported_metric=[timestamp:1349476449, status:200,
> > request:[mbean:org.apache.cassandra.db:type=CompactionManager,
> > attribute:TotalBytesCompacted, type:read], value:7548675470069]
> >
> > previous_metric=[timestamp:1349476389, status:200,
> > request:[mbean:org.apache.cassandra.db:type=CompactionManager,
> > attribute:TotalBytesCompacted, type:read], value:7551629839081]
> >
> >
> > I briefly looked at the code for CompactionManager and a few related
> classes
> > and don't see anyplace that is performing subtraction explicitly;
> however,
> > there are many additions of signed long values that are not validated and
> > could conceivably contain a negative value thus causing the
> > totalBytesCompacted to decrease.  It's interesting to note that the all
> of
> > the differences I've seen so far are more than the overflow value of a
> > signed 32 bit value.  The OS (CentOS 5.7) and sun java vm (1.6.0_29) are
> > both 64 bit.  JNA is enabled.
> >
> > Is this expected and normal?  If so, what is the correct interpretation
> of
> > this metric?  I'm seeing the negatives values a few times per hour when
> > reading it once every 60 seconds.
> >
> > -Bryan
> >
>

-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

constant CMS GC using CPU time

2012-10-18 Thread Bryan Talbot

In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
(64-bit), the nodes are often getting "stuck" in state where CMS
collections of the old space are constantly running.

The JVM configuration is using the standard settings in cassandra-env --
relevant settings are included below.  The max heap is currently set to 5
GB with 800MB for new size.  I don't believe that the cluster is overly
busy and seems to be performing well enough other than this issue.  When
nodes get into this state they never seem to leave it (by freeing up old
space memory) without restarting cassandra.  They typically enter this
state while running "nodetool repair -pr" but once they start doing this,
restarting them only "fixes" it for a couple of hours.

Compactions are completing and are generally not queued up.  All CF are
using STCS.  The busiest CF consumes about 100GB of space on disk, is write
heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
including those used for system keyspace and secondary indexes.  The number
of SSTables per node currently varies from 185-212.

Other than frequent log warnings about "GCInspector  - Heap is 0.xxx full..."
and "StorageService  - Flushing CFS(...) to relieve memory pressure" there
are no other log entries to indicate there is a problem.

Does the memory needed vary depending on the amount of data stored?  If so,
how can I predict how much jvm space is needed?  I don't want to make the
heap too large as that's bad too.  Maybe there's a memory leak related to
compaction that doesn't allow meta-data to be purged?


-Bryan


12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer
cache.
$> free -m
 total   used   free sharedbuffers cached
Mem: 12001  11870131  0  4   5778
-/+ buffers/cache:   6087   5914
Swap:0  0  0


jvm settings in cassandra-env
MAX_HEAP_SIZE="5G"
HEAP_NEWSIZE="800M"

# GC tuning options
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"


jstat shows about 12 full collections per minute with old heap usage
constantly over 75% so CMS is always over the
CMSInitiatingOccupancyFraction threshold.

$> jstat -gcutil -t 22917 5000 4
Timestamp S0 S1 E  O  P YGC YGCTFGC
 FGCT GCT
   132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523
3078.941 3585.829
   132068.0  34.70   0.00  50.02  81.23  59.88  21580  506.887 17524
3079.220 3586.107
   132073.1   0.00  24.92  46.87  81.41  59.88  21581  506.932 17525
3079.583 3586.515
   132078.1   0.00  24.92  64.71  81.40  59.88  21581  506.932 17527
3079.853 3586.785


Other hosts not currently experiencing the high CPU load have a heap less
than .75 full.

$> jstat -gcutil -t 6063 5000 4
Timestamp S0 S1 E  O  P YGC YGCTFGC
 FGCT GCT
   520731.6   0.00  12.70  36.37  71.33  59.26  46453 1688.809 14785
2130.779 3819.588
   520736.5   0.00  12.70  53.25  71.33  59.26  46453 1688.809 14785
2130.779 3819.588
   520741.5   0.00  12.70  68.92  71.33  59.26  46453 1688.809 14785
2130.779 3819.588
   520746.5   0.00  12.70  83.11  71.33  59.26  46453 1688.809 14785
2130.779 3819.588

Re: hadoop consistency level

2012-10-18 Thread Bryan Talbot

I believe that reading with CL.ONE will still cause read repair to be run
(in the background) 'read_repair_chance' of the time.

-Bryan


On Thu, Oct 18, 2012 at 1:52 PM, Andrey Ilinykh  wrote:

> On Thu, Oct 18, 2012 at 1:34 PM, Michael Kjellman
>  wrote:
> > Not sure I understand your question (if there is one..)
> >
> > You are more than welcome to do CL ONE and assuming you have hadoop nodes
> > in the right places on your ring things could work out very nicely. If
> you
> > need to guarantee that you have all the data in your job then you'll need
> > to use QUORUM.
> >
> > If you don't specify a CL in your job config it will default to ONE (at
> > least that's what my read of the ConfigHelper source for 1.1.6 shows)
> >
> I have two questions.
> 1. I can benefit from data locality (and Hadoop) only with CL ONE. Is
> it correct?
> 2. With CL QUORUM cassandra reads data from all replicas. In this case
> Hadoop doesn't give me any  benefits. Application running outside the
> cluster has the same performance. Is it correct?
>
> Thank you,
>   Andrey
>

Re: constant CMS GC using CPU time

2012-10-19 Thread Bryan Talbot

ok, let me try asking the question a different way ...

How does cassandra use memory and how can I plan how much is needed?  I
have a 1 GB memtable and 5 GB total heap and that's still not enough even
though the number of concurrent connections and garbage generation rate is
fairly low.

If I were using mysql or oracle, I could compute how much memory could be
used by N concurrent connections, how much is allocated for caching, temp
spaces, etc.  How can I do this for cassandra?  Currently it seems like the
memory used scales with the amount of bytes stored and not with how busy
the server actually is.  That's not such a good thing.

-Bryan



On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot wrote:

> In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
> (64-bit), the nodes are often getting "stuck" in state where CMS
> collections of the old space are constantly running.
>
> The JVM configuration is using the standard settings in cassandra-env --
> relevant settings are included below.  The max heap is currently set to 5
> GB with 800MB for new size.  I don't believe that the cluster is overly
> busy and seems to be performing well enough other than this issue.  When
> nodes get into this state they never seem to leave it (by freeing up old
> space memory) without restarting cassandra.  They typically enter this
> state while running "nodetool repair -pr" but once they start doing this,
> restarting them only "fixes" it for a couple of hours.
>
> Compactions are completing and are generally not queued up.  All CF are
> using STCS.  The busiest CF consumes about 100GB of space on disk, is write
> heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
> including those used for system keyspace and secondary indexes.  The number
> of SSTables per node currently varies from 185-212.
>
> Other than frequent log warnings about "GCInspector  - Heap is 0.xxx
> full..." and "StorageService  - Flushing CFS(...) to relieve memory
> pressure" there are no other log entries to indicate there is a problem.
>
> Does the memory needed vary depending on the amount of data stored?  If
> so, how can I predict how much jvm space is needed?  I don't want to make
> the heap too large as that's bad too.  Maybe there's a memory leak related
> to compaction that doesn't allow meta-data to be purged?
>
>
> -Bryan
>
>
> 12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer
> cache.
> $> free -m
>  total   used   free sharedbuffers cached
> Mem: 12001  11870131  0  4   5778
> -/+ buffers/cache:   6087   5914
> Swap:0  0  0
>
>
> jvm settings in cassandra-env
> MAX_HEAP_SIZE="5G"
> HEAP_NEWSIZE="800M"
>
> # GC tuning options
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
>
> jstat shows about 12 full collections per minute with old heap usage
> constantly over 75% so CMS is always over the
> CMSInitiatingOccupancyFraction threshold.
>
> $> jstat -gcutil -t 22917 5000 4
> Timestamp S0 S1 E  O  P YGC YGCTFGC
>  FGCT GCT
>132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523
> 3078.941 3585.829
>132068.0  34.70   0.00  50.02  81.23  59.88  21580  506.887 17524
> 3079.220 3586.107
>132073.1   0.00  24.92  46.87  81.41  59.88  21581  506.932 17525
> 3079.583 3586.515
>132078.1   0.00  24.92  64.71  81.40  59.88  21581  506.932 17527
> 3079.853 3586.785
>
>
> Other hosts not currently experiencing the high CPU load have a heap less
> than .75 full.
>
> $> jstat -gcutil -t 6063 5000 4
> Timestamp S0 S1 E  O  P YGC YGCTFGC
>  FGCT GCT
>520731.6   0.00  12.70  36.37  71.33  59.26  46453 1688.809 14785
> 2130.779 3819.588
>520736.5   0.00  12.70  53.25  71.33  59.26  46453 1688.809 14785
> 2130.779 3819.588
>520741.5   0.00  12.70  68.92  71.33  59.26  46453 1688.809 14785
> 2130.779 3819.588
>520746.5   0.00  12.70  83.11  71.33  59.26  46453 1688.809 14785
> 2130.779 3819.588
>
>
>
>

Re: constant CMS GC using CPU time

2012-10-22 Thread Bryan Talbot

The memory usage was correlated with the size of the data set.  The nodes
were a bit unbalanced which is normal due to variations in compactions.
 The nodes with the most data used the most memory.  All nodes are affected
eventually not just one.  The GC was on-going even when the nodes were not
compacting or running a heavy application load -- even when the main app
was paused constant the GC continued.

As a test we dropped the largest CF and the memory
usage immediately dropped to acceptable levels and the constant GC stopped.
 So it's definitely related to data load.  memtable size is 1 GB, row cache
is disabled and key cache is small (default).

I believe one culprit turned out to be the bloom filters.  They were 2+ GB
(as reported by nodetool cfstats anyway).  It looks like the default
bloom_filter_fp_chance defaults to 0.0 even though guides recommend 0.10 as
the minimum value.  Raising that to 0.20 for some write-mostly CF reduced
memory used by 1GB or so.

Is there any way to predict how much memory the bloom filters will consume
if the size of the row keys, number or rows is known, and fp chance is
known?

-Bryan



On Mon, Oct 22, 2012 at 12:25 AM, aaron morton wrote:

> If you are using the default settings I would try to correlate the GC
> activity with some application activity before tweaking.
>
> If this is happening on one machine out of 4 ensure that client load is
> distributed evenly.
>
> See if the raise in GC activity us related to Compaction, repair or an
> increase in throughput. OpsCentre or some other monitoring can help with
> the last one. Your mention of TTL makes me think compaction may be doing a
> bit of work churning through rows.
>
> Some things I've done in the past before looking at heap settings:
> * reduce compaction_throughput to reduce the memory churn
> * reduce in_memory_compaction_limit
> * if needed reduce concurrent_compactors
>
> Currently it seems like the memory used scales with the amount of bytes
> stored and not with how busy the server actually is.  That's not such a
> good thing.
>
> The memtable_total_space_in_mb in yaml tells C* how much memory to devote
> to the memtables. That with the global row cache setting says how much
> memory will be used with regard to "storing" data and it will not increase
> inline with the static data load.
>
> Now days GC issues are typically due to more dynamic forces, like
> compaction, repair and throughput.
>
> Hope that helps.
>
> -----
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 20/10/2012, at 6:59 AM, Bryan Talbot  wrote:
>
> ok, let me try asking the question a different way ...
>
> How does cassandra use memory and how can I plan how much is needed?  I
> have a 1 GB memtable and 5 GB total heap and that's still not enough even
> though the number of concurrent connections and garbage generation rate is
> fairly low.
>
> If I were using mysql or oracle, I could compute how much memory could be
> used by N concurrent connections, how much is allocated for caching, temp
> spaces, etc.  How can I do this for cassandra?  Currently it seems like the
> memory used scales with the amount of bytes stored and not with how busy
> the server actually is.  That's not such a good thing.
>
> -Bryan
>
>
>
> On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot wrote:
>
>> In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
>> (64-bit), the nodes are often getting "stuck" in state where CMS
>> collections of the old space are constantly running.
>>
>> The JVM configuration is using the standard settings in cassandra-env --
>> relevant settings are included below.  The max heap is currently set to 5
>> GB with 800MB for new size.  I don't believe that the cluster is overly
>> busy and seems to be performing well enough other than this issue.  When
>> nodes get into this state they never seem to leave it (by freeing up old
>> space memory) without restarting cassandra.  They typically enter this
>> state while running "nodetool repair -pr" but once they start doing this,
>> restarting them only "fixes" it for a couple of hours.
>>
>> Compactions are completing and are generally not queued up.  All CF are
>> using STCS.  The busiest CF consumes about 100GB of space on disk, is write
>> heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
>> including those used for system keyspace and secondary indexes.  The number
>> of SSTables per node currently varies from 185-212.
>>
>> Other than frequent log warnings about "GCInspector  - Heap is 0.xxx
>> full..." and "StorageSer

Re: constant CMS GC using CPU time

2012-10-23 Thread Bryan Talbot

These GC settings are the default (recommended?) settings from
cassandra-env.  I added the UseCompressedOops.

-Bryan


On Mon, Oct 22, 2012 at 6:15 PM, Will @ SOHO wrote:

>  On 10/22/2012 09:05 PM, aaron morton wrote:
>
>  # GC tuning options
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
>  JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
>  You are too far behind the reference JVM's. Parallel GC is the preferred
> and highest performing form in the current Security Baseline version of the
> JVM's.
>



-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

Re: constant CMS GC using CPU time

2012-10-23 Thread Bryan Talbot

On Mon, Oct 22, 2012 at 6:05 PM, aaron morton wrote:

> The GC was on-going even when the nodes were not compacting or running a
> heavy application load -- even when the main app was paused constant the GC
> continued.
>
> If you restart a node is the onset of GC activity correlated to some event?
>

Yes and no.  When the nodes were generally under the
.75 occupancy threshold a weekly "repair -pr" job would cause them to go
over the threshold and then stay there even after the repair had completed
and there were no ongoing compactions.  It acts as though at least some
substantial amount of memory used during repair was never dereferenced once
the repair was complete.

Once one CF in particular grew larger the constant GC would start up pretty
soon (less than 90 minutes) after a node restart even without a repair.

>
>
> As a test we dropped the largest CF and the memory
> usage immediately dropped to acceptable levels and the constant GC stopped.
>  So it's definitely related to data load.  memtable size is 1 GB, row cache
> is disabled and key cache is small (default).
>
> How many keys did the CF have per node?
> I dismissed the memory used to  hold bloom filters and index sampling.
> That memory is not considered part of the memtable size, and will end up in
> the tenured heap. It is generally only a problem with very large key counts
> per node.
>
>
I've changed the app to retain less data for that CF but I think that it
was about 400M rows per node.  Row keys are a TimeUUID.  All of the rows
are write-once, never updated, and rarely read.  There are no secondary
indexes for this particular CF.

>  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like
> the default bloom_filter_fp_chance defaults to 0.0
>
> The default should be 0.000744.
>
> If the chance is zero or null this code should run when a new SSTable is
> written
>   // paranoia -- we've had bugs in the thrift <-> avro <-> CfDef dance
> before, let's not let that break things
> logger.error("Bloom filter FP chance of zero isn't
> supposed to happen");
>
> Were the CF's migrated from an old version ?
>
>
Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to
1.1.5 with a "upgradesstables" run at each upgrade along the way.

I could not find a way to view the current bloom_filter_fp_chance settings
when they are at a default value.  JMX reports the actual fp rate and if a
specific rate is set for a CF that shows up in "describe table" but I
couldn't find out how to tell what the default was.  I didn't inspect the
source.

> Is there any way to predict how much memory the bloom filters will consume
> if the size of the row keys, number or rows is known, and fp chance is
> known?
>
>
> See o.a.c.utils.BloomFilter.getFilter() in the code
> This http://hur.st/bloomfilter appears to give similar results.
>
>
>

Ahh, very helpful.  This indicates that 714MB would be used for the bloom
filter for that one CF.

JMX / cfstats reports "Bloom Filter Space Used" but the MBean method name
(getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If
on-disk and in-memory space used is similar then summing up all the "Bloom
Filter Space Used" says they're currently consuming 1-2 GB of the heap
which is substantial.

If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?  It
just means more trips to SSTable indexes for a read correct?  Trade RAM for
time (disk I/O).

-Bryan

Re: constant CMS GC using CPU time

2012-10-24 Thread Bryan Talbot

On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli  wrote:

> On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot 
> wrote:
> > The nodes with the most data used the most memory.  All nodes are
> affected
> > eventually not just one.  The GC was on-going even when the nodes were
> not
> > compacting or running a heavy application load -- even when the main app
> was
> > paused constant the GC continued.
>
> This sounds very much like "my heap is so consumed by (mostly) bloom
> filters that I am in steady state GC thrash."
>

Yes, I think that was at least part of the issue.



>
> Do you have heap graphs which show a healthy sawtooth GC cycle which
> then more or less flatlines?
>
>

I didn't save any graphs but that is what they would look like.  I was
using jstat to monitor gc activity.

-Bryan

Re: constant CMS GC using CPU time

2012-10-25 Thread Bryan Talbot

On Thu, Oct 25, 2012 at 4:15 AM, aaron morton wrote:

>  This sounds very much like "my heap is so consumed by (mostly) bloom
>> filters that I am in steady state GC thrash."
>>
>
> Yes, I think that was at least part of the issue.
>
>
> The rough numbers I've used to estimate working set are:
>
> * bloom filter size for 400M rows at 0.00074 fp without java fudge (they
> are just a big array) 714 MB
> * memtable size 1024 MB
> * index sampling:
> *  24 bytes + key (16 bytes for UUID) = 32 bytes
>  * 400M / 128 default sampling = 3,125,000
> *  3,125,000 * 32 = 95 MB
>  * java fudge X5 or X10 = 475MB to 950MB
> * ignoring row cache and key cache
>
> So the high side number is 2213 to 2,688. High because the fudge is a
> delicious sticky guess and the memtable space would rarely be full.
>
> On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some
> goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on
> the working set and how quickly stuff get's tenured which depends on the
> workload.
>

These values seem reasonable and in line with what I was seeing.  There are
other CF and apps sharing this cluster but this one was the largest.

>
> You can confirm these guesses somewhat manually by enabling all the GC
> logging in cassandra-env.sh. Restart the node and let it operate normally,
> probably best to keep repair off.
>
>
>
I was using jstat to monitor gc activity and some snippets from that are in
my original email in this thread.  The key behavior was that full gc was
running pretty often and never able to reclaim much (if any) space.

>
> There are a few things you could try:
>
> * increase the JVM heap by say 1Gb and see how it goes
> * increase bloom filter false positive,  try 0.1 first (see
> http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance
> )
> * increase index_interval sampling in yaml.
> * decreasing compaction_throughput and in_memory_compaction_limit can
> lesson the additional memory pressure compaction adds.
> * disable caches or ensure off heap caches are used.
>

I've done several of these already in addition to changing the app to
reduce the number of rows retained.  How does compaction_throughput relate
to memory usage?  I assumed that was more for IO tuning.  I noticed that
lowering concurrent_compactors to 4 (from default of 8) lowered the memory
used during compactions.  in_memory_compaction_limit_in_mb seems to only be
used for wide rows and this CF didn't have any wider
than in_memory_compaction_limit_in_mb.  My multithreaded_compaction is
still false.

>
> Watching the gc logs and the cassandra log is a great way to get a feel
> for what works in your situation. Also take note of any scheduled
> processing your app does which may impact things, and look for poorly
> performing queries.
>
> Finally this book is a good reference on Java GC
> http://amzn.com/0137142528
>
> For my understanding what was the average row size for the 400 million
> keys ?
>
>

The compacted row mean size for the CF is 8815 (as reported by cfstats) but
that comes out to be much larger than the real load per node I was seeing.
 Each node had about 200GB of data for the CF with 4 nodes in the cluster
and RF=3.  At the time, the TTL for all columns was 3 days and
gc_grace_seconds was 5 days.  Since then I've reduced the TTL to 1 hour and
set gc_grace_seconds to 0 so the number of rows and data dropped to a level
it can handle.

-Bryan

repair, compaction, and tombstone rows

2012-10-31 Thread Bryan Talbot

I've been experiencing a behavior that is undesirable and it seems like a
bug that causes a high amount of wasted work.

I have a CF where all columns have a TTL, are generally all inserted in a
very short period of time (less than a second) and are never over-written
or explicitly deleted.  Eventually one node will run a compaction and
remove rows containing only tombstones greater than gc_grace_seconds old
which is expected.

The problem comes up when a repair is run.  During the repair the other
nodes that haven't run a compaction and still have the tombstoned rows
"fix" the inconsistency and stream the rows (which contain only a tombstone
which is more than gc_grace_seconds old) back to the node which had
compacted that row away.  This ends up occurring over and over and uses a
lot of time, storage, and bandwidth to keep repairing rows that are
intentionally missing.

I think the issue stems from the behavior of compaction of TTL rows and
repair.  The compaction of TTL rows is a node-local event which will
eventually cause tombstoned rows to disappear from the one node doing the
compaction and then get "repaired" from replicas later.  I guess this could
happen for rows which are explicitly deleted as well.

Is this a feature or a bug?  How can I avoid repair of rows that were
correctly removed via compaction from one node but not from replicas just
because compactions run independently on each node?  Every repair ends up
streaming tens of gigabytes of "missing" rows to and from replicas.

Cassandra 1.1.5 with size tiered compaction strategy and RF=3

-Bryan

Re: repair, compaction, and tombstone rows

2012-11-01 Thread Bryan Talbot

It seems like CASSANDRA-3442 might be an effective fix for this issue
assuming that I'm reading it correctly.  It sounds like the intent is to
automatically compact SSTables when a certain percent of the columns are
gcable by being deleted or with expired tombstones.  Is my understanding
correct?

Would such tables be compacted individually (1-1) or are several eligible
tables selected and compacted using the STCS compaction threshold bounds?

-Bryan

On Thu, Nov 1, 2012 at 9:43 AM, Rob Coli  wrote:

> On Thu, Nov 1, 2012 at 1:43 AM, Sylvain Lebresne 
> wrote:
> > on all your columns), you may want to force a compaction (using the
> > JMX call forceUserDefinedCompaction()) of that sstable. The goal being
> > to get read of a maximum of outdated tombstones before running the
> > repair (you could also alternatively run a major compaction prior to
> > the repair, but major compactions have a lot of nasty effect so I
> > wouldn't recommend that a priori).
>
> If sstablesplit ("reverse compaction") existed, major compaction would
> be a simple solution to this case. You'd major compact and then split
> your One Giant SSTable With No Tombstones into a number of smaller
> ones. :)
>
> https://issues.apache.org/jira/browse/CASSANDRA-4766
>
> =Rob
>
> --
> =Robert Coli
> AIM>ALK - rc...@palominodb.com
> YAHOO - rcoli.palominob
> SKYPE - rcoli_palominodb
>

Re: Cassandra upgrade issues...

2012-11-01 Thread Bryan Talbot

Note that 1.0.7 came out before 1.1 and I know there were
some compatibility issues that were fixed in later 1.0.x releases which
could affect your upgrade.  I think it would be best to first upgrade to
the latest 1.0.x release, and then upgrade to 1.1.x from there.

-Bryan



On Thu, Nov 1, 2012 at 1:27 AM, Brian Fleming wrote:

> Hi Sylvain,
>
> Simple as that!!!  Using the 1.1.5 nodetool version works as expected.  My
> mistake.
>
> Many thanks,
>
> Brian
>
>
>
>
> On Thu, Nov 1, 2012 at 8:24 AM, Sylvain Lebresne wrote:
>
>> The first thing I would check is if nodetool is using the right jar. I
>> sounds a lot like if the server has been correctly updated but
>> nodetool haven't and still use the old classes.
>> Check the nodetool executable, it's a shell script, and try echoing
>> the CLASSPATH in there and check it correctly point to what it should.
>>
>> --
>> Sylvain
>>
>> On Thu, Nov 1, 2012 at 9:10 AM, Brian Fleming 
>> wrote:
>> > Hi,
>> >
>> >
>> >
>> > I was testing upgrading from Cassandra v.1.0.7 to v.1.1.5 yesterday on a
>> > single node dev cluster with ~6.5GB of data & it went smoothly in that
>> no
>> > errors were thrown, the data was migrated to the new directory
>> structure, I
>> > can still read/write data as expected, etc.  However nodetool commands
>> are
>> > behaving strangely – full details below.
>> >
>> >
>> >
>> > I couldn’t find anything relevant online relating to these exceptions –
>> any
>> > help/pointers would be greatly appreciated.
>> >
>> >
>> >
>> > Thanks & Regards,
>> >
>> >
>> >
>> > Brian
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > ‘nodetool cleanup’ runs successfully
>> >
>> >
>> >
>> > ‘nodetool info’ produces :
>> >
>> >
>> >
>> > Token: 82358484304664259547357526550084691083
>> >
>> > Gossip active: true
>> >
>> > Load : 7.69 GB
>> >
>> > Generation No: 1351697611
>> >
>> > Uptime (seconds) : 58387
>> >
>> > Heap Memory (MB) : 936.91 / 1928.00
>> >
>> > Exception in thread "main" java.lang.ClassCastException:
>> java.lang.String
>> > cannot be cast to org.apache.cassandra.dht.Token
>> >
>> > at
>> > org.apache.cassandra.tools.NodeProbe.getEndpoint(NodeProbe.java:546)
>> >
>> > at
>> > org.apache.cassandra.tools.NodeProbe.getDataCenter(NodeProbe.java:559)
>> >
>> > at
>> org.apache.cassandra.tools.NodeCmd.printInfo(NodeCmd.java:313)
>> >
>> > at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:651)
>> >
>> >
>> >
>> > ‘nodetool repair’ produces :
>> >
>> > Exception in thread "main"
>> java.lang.reflect.UndeclaredThrowableException
>> >
>> > at $Proxy0.forceTableRepair(Unknown Source)
>> >
>> > at
>> >
>> org.apache.cassandra.tools.NodeProbe.forceTableRepair(NodeProbe.java:203)
>> >
>> > at
>> > org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:880)
>> >
>> > at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:719)
>> >
>> > Caused by: javax.management.ReflectionException: Signature mismatch for
>> > operation forceTableRepair: (java.lang.String, [Ljava.lang.String;)
>> should
>> > be (java.lang.String, boolean, [Ljava.lang.String;)
>> >
>> > at
>> > com.sun.jmx.mbeanserver.PerInterface.noSuchMethod(PerInterface.java:152)
>> >
>> > at
>> > com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:117)
>> >
>> > at
>> > com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
>> >
>> > at
>> >
>> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
>> >
>> > at
>> > com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
>> >
>> > at
>> >
>> javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
>> >
>> > at
>> >
>> javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
>> >
>> > at
>> >
>> javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265)
>> >
>> > at
>> >
>> javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360)
>> >
>> > at
>> >
>> javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)
>> >
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >
>> > at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> >
>> > at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >
>> > at java.lang.reflect.Method.invoke(Method.java:597)
>> >
>> > at
>> > sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
>> >
>> > at sun.rmi.transport.Transport$1.run(Transport.java:159)
>> >
>> > at java.security.AccessController.doPrivileged(Native Method)
>> >
>> > at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
>> >
>> > at
>> > sun.r

Re: How to upgrade a ring (0.8.9 nodes) to 1.1.5 with the minimal downtime?

2012-11-05 Thread Bryan Talbot

Do a rolling upgrade of the ring to 1.0.12 first and then upgrade to 1.1.x.
 After each rolling upgrade, you should probably do the recommend "nodetool
upgradesstables", etc.  The datastax documentation about upgrading might be
helpful for you: http://www.datastax.com/docs/1.1/install/upgrading

-Bryan


On Mon, Nov 5, 2012 at 10:55 AM, Yan Wu  wrote:

> Hello,
>
> I have a Cassandra ring with 4 nodes in 0.8.9 and like to upgrade all
> nodes to 1.1.5.
> It would be great that the upgrade has no downtime or minimal downtime of
> the ring.
> After I brought down one of the nodes and upgraded it to 1.1.5, when I
> tried to bring it up,
> the new 1.1.5 node looks good but the rest of three 0.8.9 nodes started
> throwing exceptions:
> ---
> Fatal exception in thread Thread[GossipStage:2,5,main]
> java.lang.UnsupportedOperationException: Not a time-based UUID
> at
> org.apache.cassandra.service.MigrationManager.rectify(MigrationManager.java:92)
> at
> org.apache.cassandra.service.MigrationManager.onAlive(MigrationManager.java:75)
> at org.apache.cassandra.gms.Gossiper.markAlive(Gossiper.java:707)
> at
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:750)
> at
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:809)
> at
> org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:68)
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 
> Then later
> 
> ERROR 12:03:20,925 Fatal exception in thread Thread[HintedHandoff:1,1,main]
> java.lang.RuntimeException: java.lang.RuntimeException: Could not reach
> schema agreement with /xx.xx.xx.xx in 6ms
> at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.RuntimeException: Could not reach schema agreement
> with /xx.xx.xx.xx in 6ms
> at
> org.apache.cassandra.db.HintedHandOffManager.waitForSchemaAgreement(HintedHandOffManager.java:293)
> at
> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:304)
> at
> org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandOffManager.java:89)
> at
> org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:397)
> at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> ... 3 more
> 
>
> Any suggestions?   Thanks in advance.
>
> Yan
>

Re: repair, compaction, and tombstone rows

2012-11-05 Thread Bryan Talbot

As the OP of this thread, it is a big itch for my use case.  Repair ends up
streaming tens of gigabytes of data which has expired TTL and has been
compacted away on some nodes but not yet on others.  The wasted work is not
nice plus it drives up the memory usage (for bloom filters, indexes, etc)
of all nodes since there are many more rows to track than planned.
 Disabling the periodic repair lowered the per-node load by 100GB which was
all dead data in my case.

-Bryan


On Mon, Nov 5, 2012 at 5:12 PM, horschi  wrote:

>
>
> That's true, we could just create an already gcable tombstone. It's a bit
>> of an abuse of the localDeletionTime but why not. Honestly a good part of
>> the reason we haven't done anything yet is because we never really had
>> anything for which tombstones of expired columns where a big pain point.
>> Again, feel free to open a ticket (but what we should do is retrieve the
>> ttl from the localExpirationTime when creating the tombstone, not using the
>> creation time (partly because that creation time is a user provided
>> timestamp so we can't use it, and because we must still keep tombstones if
>> the ttl < gcGrace)).
>>
>
> Created CASSANDRA-4917. I changed the example implementation to use
> (localExpirationTime-timeToLive) for the tombstone. I agree this is not the
> biggest itch to scratch. But it might save a few seeks here and there :-)
>
>
> Did you also have a look at DeletedColumn? It uses the updateDigest
> implementation from its parent class, which applies also the value to the
> digest. Unfortunetaly the value is the localDeletionTime, which is being
> generated on each node individually, right? (at RowMutation.delete)
> The resolution of the time is low, so there is a good chance the
> timestamps will match on all nodes, but that should be nothing to rely on.
>
>
> cheers,
> Christian
>
>
>
>
>

Re: repair, compaction, and tombstone rows

2012-11-06 Thread Bryan Talbot

On Tue, Nov 6, 2012 at 8:27 AM, horschi  wrote:

>
>
>> it is a big itch for my use case.  Repair ends up streaming tens of
>> gigabytes of data which has expired TTL and has been compacted away on some
>> nodes but not yet on others.  The wasted work is not nice plus it drives up
>> the memory usage (for bloom filters, indexes, etc) of all nodes since there
>> are many more rows to track than planned.  Disabling the periodic repair
>> lowered the per-node load by 100GB which was all dead data in my case.
>
>
> What is the issue with your setup? Do you use TTLs or do you think its due
> to DeletedColumns?  Was your intension to push the idea of removing
> localDeletionTime from DeletedColumn.updateDigest ?
>
>
>
I don't know enough about the code level implementation to comment on the
validity of the fix.  My main issue is that we use a lot of TTL columns and
in many cases all columns have a TTL that is less than gc_grace.  The
problem arises when the columns are gc-able and are compacted away on one
node but not on all replicas, the periodic repair process ends up copying
all the garbage columns & rows back to all other replicas.  It consumes a
lot of repair resources and makes rows stick around for much longer than
they really should which consumes even more cluster resources.

-Bryan

Re: Admin for cassandra?

2012-11-16 Thread Bryan Talbot

The https://github.com/sebgiroux/Cassandra-Cluster-Admin app does some
of what you're asking.  It allows basic browsing and some admin
functionality.  If you want to run actual CQL queries though, you
currently need to use another app for that (like cqlsh).

-Bryan


On Thu, Nov 15, 2012 at 11:30 PM, Timmy Turner  wrote:
> I think an eclipse plugin would be the wrong way to go here. Most people
> probably just want to browse through the columnfamilies and see whether
> their queries work out or not. This functionality is imho best implemented
> as some form of a light-weight editor, not a full blown IDE.
>
> I do have something of this kind scheduled as small part of a larger project
> (seeing as how there is currently no properly working tool that provides
> this functionality), but concrete results are probably still a few months
> out..
>
>
> 2012/11/16 Edward Capriolo 
>>
>> We should build an eclipse plugin named Eclipsandra or something.
>>
>> On Thu, Nov 15, 2012 at 9:45 PM, Wz1975  wrote:
>> > Cqlsh is probably the closest you will get. Or pay big bucks to hire
>> > someone
>> > to develop one for you:)
>> >
>> >
>> > Thanks.
>> > -Wei
>> >
>> > Sent from my Samsung smartphone on AT&T
>> >
>> >
>> >  Original message 
>> > Subject: Admin for cassandra?
>> > From: Kevin Burton 
>> > To: user@cassandra.apache.org
>> > CC:
>> >
>> >
>> > Is there an IDE for a Cassandra database? Similar to the SQL Server
>> > Management Studio for SQL server. I mainly want to execute queries and
>> > see
>> > the results. Preferably that runs under a Windows OS.
>> >
>> >
>> >
>> > Thank you.
>> >
>> >
>
>

Re: need some help with row cache

2012-11-27 Thread Bryan Talbot

On Tue, Nov 27, 2012 at 8:16 PM, Yiming Sun  wrote:
> Hello,
>
> but it is not clear to me where this setting belongs to, because even in the
> v1.1.6 conf/cassandra.yaml,  there is no such property, and apparently
> adding this property to the yaml causes a fatal configuration error upon
> server startup,
>

It's a per column family setting that can be applied using the CLI or CQL.

With CQL3 it would be

ALTER TABLE  WITH caching = 'rows_only';

to enable the row cache but no key cache for that CF.

-Bryan

Re: need some help with row cache

2012-11-28 Thread Bryan Talbot

The row cache itself is global and the size is set with
row_cache_size_in_mb.  It must be enabled per CF using the proper
settings.  CQL3 isn't complete yet in C* 1.1 so if the cache settings
aren't shown there, then you'll probably need to use cassandra-cli.

-Bryan


On Tue, Nov 27, 2012 at 10:41 PM, Wz1975  wrote:
> Use cassandracli.
>
>
> Thanks.
> -Wei
>
> Sent from my Samsung smartphone on AT&T
>
>
>  Original message 
> Subject: Re: need some help with row cache
> From: Yiming Sun 
> To: user@cassandra.apache.org
> CC:
>
>
> Also, what command can I used to see the "caching" setting?  "DESC TABLE
> " doesn't list caching at all.  Thanks.
>
> -- Y.
>
>
> On Wed, Nov 28, 2012 at 12:15 AM, Yiming Sun  wrote:
>>
>> Hi Bryan,
>>
>> Thank you very much for this information.  So in other words, the settings
>> such as row_cache_size_in_mb in YAML alone are not enough, and I must also
>> specify the caching attribute on a per column family basis?
>>
>> -- Y.
>>
>>
>> On Tue, Nov 27, 2012 at 11:57 PM, Bryan Talbot 
>> wrote:
>>>
>>> On Tue, Nov 27, 2012 at 8:16 PM, Yiming Sun  wrote:
>>> > Hello,
>>> >
>>> > but it is not clear to me where this setting belongs to, because even
>>> > in the
>>> > v1.1.6 conf/cassandra.yaml,  there is no such property, and apparently
>>> > adding this property to the yaml causes a fatal configuration error
>>> > upon
>>> > server startup,
>>> >
>>>
>>> It's a per column family setting that can be applied using the CLI or
>>> CQL.
>>>
>>> With CQL3 it would be
>>>
>>> ALTER TABLE  WITH caching = 'rows_only';
>>>
>>> to enable the row cache but no key cache for that CF.
>>>
>>> -Bryan
>>
>>
>

Re: outOfMemory error

2012-11-28 Thread Bryan Talbot

Well, asking for 500MB of data at once for a server with such modest
specs is asking for troubles.  Here are my suggestions.

Disable the 1 GB row cache
Consider allocating that memory for the java heap "Xms2500m Xmx2500m"
Don't fetch all the columns at once -- page through them a slice at a time
Increase the memtable to more than 64 MB if you want to write data to
this cluster

-Bryan



On Wed, Nov 28, 2012 at 5:06 AM, Damien Lejeune  wrote:
> Hi all,
>
> I'm currently experiencing a outOfMemory problem with Cassandra-1.1.6 on
> Windows XP-Pro (32-bit). The server crashes when I try to query it with a
> relatively small amount of data (around 100 rows with 5 columns each: to
> be precise, on my configuration, querying 75 or more rows makes the server
> to crash).
> I tried with different library (Hector, JDBC, Thrift) and with the Cassandra
> stress tool. All lead to the same outOfMemory problem.
>
> My dataset is composed, for each row, of: 1 column in DateType, 4
> columns in DoubleType. I ran a query to fetch the entire dataset (around
> 330MB for the raw data + around 200MB for the metadata) and got the log at
> the end of this message.
>
> I also checked the heap-dump with Mat which displays these top values:
> Class Name
>  Objects  Shallow Heap
> java.nio.HeapByteBuffer   16,253,559
> 780,170,832
> bytes[]   16,254,013
> 330,207,640 <-- Data ?
> java.util.TreeMap$Entry8,126,711
> 260,054,752
> org.apache.cassandra.db.Column 8,116,589
> 194,798,136 <-- Metadata ?
>
> I tried to change the configuration in Cassandra for the values:
> - row_cache_size_in_mb: tried different value between [0,1000] MB
> - flush_largest_memtables_at: set to 0.1, but tried with 0.75
> - reduce_cache_sizes_at: tried 0.85, 0.6, 0.2 and 0.1
> - reduce_cache_capacity_to: tried 0.6 and 0.15
> - memtable_total_space_in_mb: 64 MB, but also tried to disable it (-> 1/3 of
> the heap)
> - Xms1G
> - Xmx1500M
> with no real observable improvements regarding my problem.
>
> My Cassandra server and client both run on the same machine.
>
> Here are the characteristics of my system configuration:
> - Cassandra-1.1.6
> - java version "1.6.0_20"
>  Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>  Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)
> - Windows XP-Pro 32 bits with service pack 3
> - CPU double-core, 32 bits @2.26GHz
> - 3.48 of RAM
>
> I'm aware that my system configuration is not an optimized environment to
> make Cassandra to run efficiently, but I wonder if you guys know a
> workaround (or any idea on how) to fix this problem. Part of the answer is
> probably that I do not have enough RAM to run the process, but I also wonder
> if it is a 'normal' behaviour for Cassandra to handle this particular test
> case that way.
>
> Cheers,
>
> Damien
>
>  Cassandra's LOG ---
>
> Starting Cassandra Server
>  INFO 09:10:27,171 Logging initialized
>  INFO 09:10:27,171 JVM vendor/version: Java HotSpot(TM) Client VM/1.6.0_18
>  INFO 09:10:27,171 Heap size: 1072103424/1569521664
>  INFO 09:10:27,171 Classpath:
> E:\dl_benchmark\db\apache-cassandra-1.1.6\conf;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\antlr-3.2.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\apache-cassandra-1.1.6.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\apache-cassandra-clientutil-1.1.6.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\apache-cassandra-thrift-1.1.6.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\avro-1.4.0-fixes.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\avro-1.4.0-sources-fixes.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\commons-cli-1.1.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\commons-codec-1.2.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\commons-lang-2.4.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\compress-lzf-0.8.4.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\concurrentlinkedhashmap-lru-1.3.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\guava-r08.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\high-scale-lib-1.1.2.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\jackson-core-asl-1.9.2.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\jackson-mapper-asl-1.9.2.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\jamm-0.2.5.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\jline-0.9.94.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\json-simple-1.1.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\libthrift-0.7.0.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\log4j-1.2.16.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\metrics-core-2.0.3.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\servlet-api-2.5-20081211.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\slf4j-api-1.6.1.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\slf4j-log4j12-1.6.1.jar;E:\dl_benchmark\db\apache-cassandra-1.1.6\lib\snakeyaml-1.6.ja

Re: CQL timestamps and timezones

2012-12-07 Thread Bryan Talbot

With 1.1.5, the TS is displayed with the local timezone and seems correct.

cqlsh:bat> create table test (id uuid primary key, ts timestamp );
cqlsh:bat> insert into test (id,ts) values (
'89d09c88-40ac-11e2-a1e2-6067201fae78',  '2012-12-07T10:00:00-');
cqlsh:bat> select * from test;
 id   | ts
--+--
 89d09c88-40ac-11e2-a1e2-6067201fae78 | 2012-12-07 02:00:00-0800

cqlsh:bat>


-Bryan


On Fri, Dec 7, 2012 at 1:14 PM, B. Todd Burruss  wrote:

> trying to figure out if i'm doing something wrong or a bug.  i am
> creating a simple schema, inserting a timestamp using ISO8601 format,
> but when retrieving the timestamp, the timezone is displayed
> incorrectly.  i'm inserting using GMT, the result is shown with
> "+", but the time is for my local timezone (-0800)
>
> tried with 1.1.6 (DSE 2.2.1), and 1.2.0-rc1-SNAPSHOT
>
> here's the trace:
>
> bin/cqlsh
> Connected to Test Cluster at localhost:9160.
> [cqlsh 2.3.0 | Cassandra 1.2.0-rc1-SNAPSHOT | CQL spec 3.0.0 | Thrift
> protocol 19.35.0]
> Use HELP for help.
> cqlsh> CREATE KEYSPACE btoddb WITH replication =
> {'class':'SimpleStrategy', 'replication_factor':1};
> cqlsh>
> cqlsh> USE btoddb;
> cqlsh:btoddb> CREATE TABLE test (
>   ...   id uuid PRIMARY KEY,
>   ...   ts TIMESTAMP
>   ... );
> cqlsh:btoddb>
> cqlsh:btoddb> INSERT INTO test
>   ...   (id, ts)
>   ...   values (
>   ... '89d09c88-40ac-11e2-a1e2-6067201fae78',
>   ... '2012-12-07T10:00:00-'
>   ...   );
> cqlsh:btoddb>
> cqlsh:btoddb> SELECT * FROM test;
>
>  id   | ts
> --+--
>  89d09c88-40ac-11e2-a1e2-6067201fae78 | 2012-12-07 02:00:00+
>
> cqlsh:btoddb>
>

Re: High disk read throughput on only one node.

2012-12-19 Thread Bryan Talbot

Or maybe the clients always connect to that server which can satisfy all
reads.  They have 3 nodes with RF=3.

-Bryan


On Wed, Dec 19, 2012 at 12:15 PM, aaron morton wrote:

> Is there a sustained difference or did it settle back ?
> Could this have been compaction or repair or upgrade tables working ?
>
> Do the read / write counts available in nodetool cfstats show anything
> different ?
>
> Cheers
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 19/12/2012, at 6:26 AM, Alain RODRIGUEZ  wrote:
>
> Hi,
>
> I am experimenting a strange issue in my C* 1.1.6, 3 node, RF=3 cluster.
>
> root@ip-10-64-177-38:~# nodetool ring
> Note: Ownership information does not include topology, please specify a
> keyspace.
> Address DC  RackStatus State   Load
>  OwnsToken
>
>  141784319550391032739561396922763706368
> 10.64.167.32eu-west 1b  Up Normal  178.37 GB
> 33.33%  28356863910078203714492389662765613056
> 10.250.202.154  eu-west 1b  Up Normal  174.93 GB
> 33.33%  85070591730234615865843651857942052863
> 10.64.177.38eu-west 1b  Up Normal  167.13 GB
> 33.33%  141784319550391032739561396922763706368
>
> root@ip-10-64-177-38:~# nodetool ring cassa_teads
> Address DC  RackStatus State   Load
>  Effective-Ownership Token
>
>  141784319550391032739561396922763706368
> 10.64.167.32eu-west 1b  Up Normal  178.37 GB
> 100.00% 28356863910078203714492389662765613056
> 10.250.202.154  eu-west 1b  Up Normal  174.93 GB
> 100.00% 85070591730234615865843651857942052863
> 10.64.177.38eu-west 1b  Up Normal  167.13 GB
> 100.00% 141784319550391032739561396922763706368
>
> My cluster is well balanced, all the nodes have an identical
> configuration, but yet I have a lot of disk reads on one of them as you can
> see in these screenshots:
>
> Datastax OpsCenter :
> http://img4.imageshack.us/img4/2528/datastaxopscenterheighr.png
> or
> AWS console :
> http://img59.imageshack.us/img59/5223/ec2managementconsole.png
>
> I have tried to see what is read from any nodes with "inotifywatch -r
> -t300 /raid0 > inotifywatch5min" and get the following result:
>
> root@ip-10-64-177-38:~# cat inotifywatch5min
> total   access  close_nowrite  open  filename
> 234580   113280  60691 60609
>  /raid0/cassandra/data/cassa_teads/data_viewer/
> 56013  27108   1445414451
> /raid0/cassandra/data/cassa_teads/data_ip_viewer/
> 30748  14998   7884  7866
>  /raid0/cassandra/data/cassa_teads/algo_ad_newcapping/
> 301  147  76  78
> /raid0/cassandra/data/cassa_teads/data_transac/
> 191   95   48  48
>/raid0/cassandra/data/cassa_teads/data_cust_website_viewer/
> 6   033   /raid0/cassandra/
> 2   011
>   /raid0/cassandra/data/
> 2   011
>   /raid0/cassandra/commitlog/
> 2   011
>   /raid0/cassandra/saved_caches/
>
>
> root@ip-10-250-202-154:~# cat inotifywatch5min
> total   access  modify  close_write  close_nowrite  open  moved_from
>  moved_to  create  delete  filename
> 307378  115456  77706   1257119 57035
>  /raid0/cassandra/data/cassa_teads/data_viewer/
> 5539526878   0   0 14259 14258
>   /raid0/cassandra/data/cassa_teads/data_ip_viewer/
> 3615517653   0   0 9256   9246
>  /raid0/cassandra/data/cassa_teads/algo_ad_newcapping/
> 7377  188  7153  6 411
>   /raid0/cassandra/data/cassa_teads/data_action/
> 4010 3646 412
>   /raid0/cassandra/data/cassa_teads/stats_ad_uv/
> 244120  0   0 62   62
>  /raid0/cassandra/data/cassa_teads/data_transac/
> 160760   0 42   42
> /raid0/cassandra/data/cassa_teads/data_cust_website_viewer/
> 26  0 0   0 13
> 13  /raid0/cassandra/data/cassa_teads/
> 12  0 2   2 1
>  3   /raid0/cassandra/commitlog/
> 60 0   0 3
>  3  /raid0/cassandra/
> 20 0   0 1
>  1  /raid0/cassandra/data/
> 20 0   0 1
>  1  /raid0/cassandra/saved_caches/
>
>
> root@ip-10-64-167-32:~# cat inotifywatch5min
> total   access  mo

Re: High disk read throughput on only one node.

2012-12-19 Thread Bryan Talbot

Oh, you're on ec2.  Maybe the dynamic snitch is detecting that one node is
performing better than the others so is routing more traffic to it?

http://www.datastax.com/docs/1.1/configuration/node_configuration#dynamic-snitch-badness-threshold

-Bryan




On Wed, Dec 19, 2012 at 2:30 PM, Alain RODRIGUEZ  wrote:

> @Aaron
> "Is there a sustained difference or did it settle back ? "
>
> Sustained, clearly. During the day all nodes read at about 6MB/s while
> this one reads at 30-40 MB/s. At night while other reads 2MB/s the "broken"
> nodes reads at 8-10MB/s
>
> "Could this have been compaction or repair or upgrade tables working ? "
>
> Was my first thought but definitely no. this occurs continuously.
>
> "Do the read / write counts available in nodetool cfstats show anything
> different ? "
>
> The cfstats shows different counts (a lot less reads/writes for the "bad"
> node)  but they didn't join the ring at the same time. I join you the
> cfstats just in case it could help somehow.
>
> Node  38: http://pastebin.com/ViS1MR8d (bad one)
> Node  32: http://pastebin.com/MrSTHH9F
> Node 154: http://pastebin.com/7p0Usvwd
>
> @Bryan
>
>  "clients always connect to that server"
>
> I didn't join it in the screenshot from AWS console, but AWS report an
> (almost) equal network within the nodes (same for output and cpu). The cpu
> load is a lot higher in the broken node as shown by the OpsCenter, but
> that's due to the high iowait...)
>



-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

Re: State of Cassandra and Java 7

2012-12-21 Thread Bryan Talbot

Brian, did any of your issues with java 7 result in corrupting data in
cassandra?

We just ran into an issue after upgrading a test cluster from Cassandra
1.1.5 and Oracle JDK 1.6.0_29-b11 to Cassandra 1.1.7 and 7u10.

What we saw is values in columns with validation
Class=org.apache.cassandra.db.marshal.LongType that were proper integers
becoming corrupted so that they become stored as strings.  I don't have
a reproducible test case yet but will work on making one over the holiday
if I can.

For example, a column with a long type that was originally written and
stored properly (say with value 1200) was somehow changed during cassandra
operations (compaction seems the only possibility) to be the value '1200'
with quotes.

The data was written using the phpcassa library and that application and
library haven't been changed.  This has only happened on our test cluster
which was upgraded and hasn't happened on our live cluster which was not
upgraded.  Many of our column families were affected and all affected
columns are Long (or bigint for cql3).

Errors when reading using CQL3 command client look like this:

Failed to decode value '1356441225' (for column 'expires') as bigint:
unpack requires a string argument of length 8

and when reading with cassandra-cli the error is

[default@cf] get
token['fbc1e9f7cc2c0c2fa186138ed28e5f691613409c0bcff648c651ab1f79f9600b'];
=> (column=client_id, value=8ec4c29de726ad4db3f89a44cb07909c04f90932d,
timestamp=1355836425784329, ttl=648000)
A long is exactly 8 bytes: 10

-Bryan

On Mon, Dec 17, 2012 at 7:33 AM, Brian Tarbox wrote:

> I was using jre-7u9-linux-x64  which was the latest at the time.
>
> I'll confess that I did not file any bugs...at the time the advice from
> both the Cassandra and Zookeeper lists was to stay away from Java 7 (and my
> boss had had enough of my reporting that "*the problem was Java 7"* for
> me to spend a lot more time getting the details).
>
> Brian
>
>
> On Sun, Dec 16, 2012 at 4:54 AM, Sylvain Lebresne wrote:
>
>> On Sat, Dec 15, 2012 at 7:12 PM, Michael Kjellman <
>> mkjell...@barracuda.com> wrote:
>>
>>> What "issues" have you ran into? Actually curious because we push
>>> 1.1.5-7 really hard and have no issues whatsoever.
>>>
>>>
>> A related question is "which which version of java 7 did you try"? The
>> first releases of java 7 were apparently famous for having many issues but
>> it seems the more recent updates are much more stable.
>>
>> --
>> Sylvain
>>
>>
>>> On Dec 15, 2012, at 7:51 AM, "Brian Tarbox" 
>>> wrote:
>>>
>>> We've reverted all machines back to Java 6 after running into numerous
>>> Java 7 issues...some running Cassandra, some running Zookeeper, others just
>>> general problems.  I don't recall any other major language release being
>>> such a mess.
>>>
>>>
>>> On Fri, Dec 14, 2012 at 5:07 PM, Bill de hÓra  wrote:
>>>
 "At least that would be one way of defining "officially supported".

 Not quite, because, Datastax is not Apache Cassandra.

 "the only issue related to Java 7 that I know of is CASSANDRA-4958, but
 that's osx specific (I wouldn't advise using osx in production anyway) and
 it's not directly related to Cassandra anyway so you can easily use the
 beta version of snappy-java as a workaround if you want to. So that non
 blocking issue aside, and as far as we know, Cassandra supports Java 7. Is
 it rock-solid in production? Well, only repeated use in production can
 tell, and that's not really in the hand of the project."

 Exactly right. If enough people use Cassandra on Java7 and enough
 people file bugs about Java 7 and enough people work on bugs for Java 7
 then Cassandra will eventually work well enough on Java7.

 Bill

 On 14 Dec 2012, at 19:43, Drew Kutcharian  wrote:

 > In addition, the DataStax official documentation states: "Versions
 earlier than 1.6.0_19 should not be used. Java 7 is not recommended."
 >
 > http://www.datastax.com/docs/1.1/install/install_rpm
 >
 >
 >
 > On Dec 14, 2012, at 9:42 AM, Aaron Turner 
 wrote:
 >
 >> Does Datastax (or any other company) support Cassandra under Java 7?
 >> Or will they tell you to downgrade when you have some problem,
 because
 >> they don't support C* running on 7?
 >>
 >> At least that would be one way of defining "officially supported".
 >>
 >> On Fri, Dec 14, 2012 at 2:22 AM, Sylvain Lebresne <
 sylv...@datastax.com> wrote:
 >>> What kind of official statement do you want? As far as I can be
 considered
 >>> an official voice of the project, my statement is: "various people
 run in
 >>> production with Java 7 and it seems to work".
 >>>
 >>> Or to answer the initial question, the only issue related to Java 7
 that I
 >>> know of is CASSANDRA-4958, but that's osx specific (I wouldn't
 advise using
 >>> osx in production anyway) and it's not direct

LCS not removing rows with all TTL expired columns

2013-01-16 Thread Bryan Talbot

On cassandra 1.1.5 with a write heavy workload, we're having problems
getting rows to be compacted away (removed) even though all columns have
expired TTL.  We've tried size tiered and now leveled and are seeing the
same symptom: the data stays around essentially forever.

Currently we write all columns with a TTL of 72 hours (259200 seconds) and
expect to add 10 GB of data to this CF per day per node.  Each node
currently has 73 GB for the affected CF and shows no indications that old
rows will be removed on their own.

Why aren't rows being removed?  Below is some data from a sample row which
should have been removed several days ago but is still around even though
it has been involved in numerous compactions since being expired.

$> ./bin/nodetool -h localhost getsstables metrics request_summary
459fb460-5ace-11e2-9b92-11d67b6163b4
/virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db

$> ls -alF
/virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
-rw-rw-r-- 1 sandra sandra 5252320 Jan 16 08:42
/virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db

$> ./bin/sstable2json
/virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
-k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 "%x"')
{
"34353966623436302d356163652d313165322d396239322d313164363762363136336234":
[["app_name","50f21d3d",1357785277207001,"d"],
["client_ip","50f21d3d",1357785277207001,"d"],
["client_req_id","50f21d3d",1357785277207001,"d"],
["mysql_call_cnt","50f21d3d",1357785277207001,"d"],
["mysql_duration_us","50f21d3d",1357785277207001,"d"],
["mysql_failure_call_cnt","50f21d3d",1357785277207001,"d"],
["mysql_success_call_cnt","50f21d3d",1357785277207001,"d"],
["req_duration_us","50f21d3d",1357785277207001,"d"],
["req_finish_time_us","50f21d3d",1357785277207001,"d"],
["req_method","50f21d3d",1357785277207001,"d"],
["req_service","50f21d3d",1357785277207001,"d"],
["req_start_time_us","50f21d3d",1357785277207001,"d"],
["success","50f21d3d",1357785277207001,"d"]]
}


Decoding the column timestamps to shows that the columns were written at
"Thu, 10 Jan 2013 02:34:37 GMT" and that their TTL expired at "Sun, 13 Jan
2013 02:34:37 GMT".  The date of the SSTable shows that it was generated on
Jan 16 which is 3 days after all columns have TTL-ed out.


The schema shows that gc_grace is set to 0 since this data is write-once,
read-seldom and is never updated or deleted.

create column family request_summary
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'UTF8Type'
  and read_repair_chance = 0.1
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 0
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
  and caching = 'NONE'
  and bloom_filter_fp_chance = 1.0
  and compression_options = {'chunk_length_kb' : '64',
'sstable_compression' :
'org.apache.cassandra.io.compress.SnappyCompressor'};


Thanks in advance for help in understanding why rows such as this are not
removed!

-Bryan

Re: LCS not removing rows with all TTL expired columns

2013-01-16 Thread Bryan Talbot

According to the timestamps (see original post) the SSTable was written
(thus compacted compacted) 3 days after all columns for that row had
expired and 6 days after the row was created; yet all columns are still
showing up in the SSTable.  Note that the column shows now rows when a
"get" for that key is run so that's working correctly, but the data is
lugged around far longer than it should be -- maybe forever.


-Bryan


On Wed, Jan 16, 2013 at 5:44 PM, Andrey Ilinykh  wrote:

> To get column removed you have to meet two requirements
> 1. column should be expired
> 2. after that CF gets compacted
>
> I guess your expired columns are propagated to high tier CF, which gets
> compacted rarely.
> So, you have to wait when high tier CF gets compacted.
>
> Andrey
>
>
>
> On Wed, Jan 16, 2013 at 11:39 AM, Bryan Talbot wrote:
>
>> On cassandra 1.1.5 with a write heavy workload, we're having problems
>> getting rows to be compacted away (removed) even though all columns have
>> expired TTL.  We've tried size tiered and now leveled and are seeing the
>> same symptom: the data stays around essentially forever.
>>
>> Currently we write all columns with a TTL of 72 hours (259200 seconds)
>> and expect to add 10 GB of data to this CF per day per node.  Each node
>> currently has 73 GB for the affected CF and shows no indications that old
>> rows will be removed on their own.
>>
>> Why aren't rows being removed?  Below is some data from a sample row
>> which should have been removed several days ago but is still around even
>> though it has been involved in numerous compactions since being expired.
>>
>> $> ./bin/nodetool -h localhost getsstables metrics request_summary
>> 459fb460-5ace-11e2-9b92-11d67b6163b4
>>
>> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
>>
>> $> ls -alF
>> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
>> -rw-rw-r-- 1 sandra sandra 5252320 Jan 16 08:42
>> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
>>
>> $> ./bin/sstable2json
>> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
>> -k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 "%x"')
>> {
>> "34353966623436302d356163652d313165322d396239322d313164363762363136336234":
>> [["app_name","50f21d3d",1357785277207001,"d"],
>> ["client_ip","50f21d3d",1357785277207001,"d"],
>> ["client_req_id","50f21d3d",1357785277207001,"d"],
>> ["mysql_call_cnt","50f21d3d",1357785277207001,"d"],
>> ["mysql_duration_us","50f21d3d",1357785277207001,"d"],
>> ["mysql_failure_call_cnt","50f21d3d",1357785277207001,"d"],
>> ["mysql_success_call_cnt","50f21d3d",1357785277207001,"d"],
>> ["req_duration_us","50f21d3d",1357785277207001,"d"],
>> ["req_finish_time_us","50f21d3d",1357785277207001,"d"],
>> ["req_method","50f21d3d",1357785277207001,"d"],
>> ["req_service","50f21d3d",1357785277207001,"d"],
>> ["req_start_time_us","50f21d3d",1357785277207001,"d"],
>> ["success","50f21d3d",1357785277207001,"d"]]
>> }
>>
>>
>> Decoding the column timestamps to shows that the columns were written at
>> "Thu, 10 Jan 2013 02:34:37 GMT" and that their TTL expired at "Sun, 13 Jan
>> 2013 02:34:37 GMT".  The date of the SSTable shows that it was generated on
>> Jan 16 which is 3 days after all columns have TTL-ed out.
>>
>>
>> The schema shows that gc_grace is set to 0 since this data is write-once,
>> read-seldom and is never updated or deleted.
>>
>> create column family request_summary
>>   with column_type = 'Standard'
>>   and comparator = 'UTF8Type'
>>   and default_validation_class = 'UTF8Type'
>>   and key_validation_class = 'UTF8Type'
>>   and read_repair_chance = 0.1
>>   and dclocal_read_repair_chance = 0.0
>>   and gc_grace = 0
>>   and min_compaction_threshold = 4
>>   and max_compaction_threshold = 32
>>   and replicate_on_write = true
>>   and compaction_strategy =
>> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
>>   and caching = 'NONE'
>>   and bloom_filter_fp_chance = 1.0
>>   and compression_options = {'chunk_length_kb' : '64',
>> 'sstable_compression' :
>> 'org.apache.cassandra.io.compress.SnappyCompressor'};
>>
>>
>> Thanks in advance for help in understanding why rows such as this are not
>> removed!
>>
>> -Bryan
>>
>>
>

Re: LCS not removing rows with all TTL expired columns

2013-01-17 Thread Bryan Talbot

 you set the gc_grace_seconds to 0.
> If not they do not get a chance to delete previous versions of the column
> which already exist on disk. So when the compaction ran your ExpiringColumn
> was turned into a DeletedColumn and placed on disk. 
>
> ** **
>
> I would expect the next round of compaction to remove these columns. 
>
> ** **
>
> There is a new feature in 1.2 that may help you here. It will do a special
> compaction of individual sstables when they have a certain proportion of
> dead columns https://issues.apache.org/jira/browse/CASSANDRA-3442 
>
> ** **
>
> Also interested to know if LCS helps. 
>
> ** **
>
> Cheers
>
>  
>
> ** **
>
> -
>
> Aaron Morton
>
> Freelance Cassandra Developer
>
> New Zealand
>
> ** **
>
> @aaronmorton
>
> http://www.thelastpickle.com
>
> ** **
>
> On 17/01/2013, at 2:55 PM, Bryan Talbot  wrote:***
> *
>
>
>
> 
>
> According to the timestamps (see original post) the SSTable was written
> (thus compacted compacted) 3 days after all columns for that row had
> expired and 6 days after the row was created; yet all columns are still
> showing up in the SSTable.  Note that the column shows now rows when a
> "get" for that key is run so that's working correctly, but the data is
> lugged around far longer than it should be -- maybe forever.
>
> ** **
>
> ** **
>
> -Bryan
>
> ** **
>
> On Wed, Jan 16, 2013 at 5:44 PM, Andrey Ilinykh 
> wrote:
>
> To get column removed you have to meet two requirements 
>
> 1. column should be expired
>
> 2. after that CF gets compacted
>
> ** **
>
> I guess your expired columns are propagated to high tier CF, which gets
> compacted rarely.
>
> So, you have to wait when high tier CF gets compacted.  
>
> ** **
>
> Andrey
>
> ** **
>
> ** **
>
> On Wed, Jan 16, 2013 at 11:39 AM, Bryan Talbot 
> wrote:
>
> On cassandra 1.1.5 with a write heavy workload, we're having problems
> getting rows to be compacted away (removed) even though all columns have
> expired TTL.  We've tried size tiered and now leveled and are seeing the
> same symptom: the data stays around essentially forever.  
>
> ** **
>
> Currently we write all columns with a TTL of 72 hours (259200 seconds) and
> expect to add 10 GB of data to this CF per day per node.  Each node
> currently has 73 GB for the affected CF and shows no indications that old
> rows will be removed on their own.
>
> ** **
>
> Why aren't rows being removed?  Below is some data from a sample row which
> should have been removed several days ago but is still around even though
> it has been involved in numerous compactions since being expired.
>
> ** **
>
> $> ./bin/nodetool -h localhost getsstables metrics request_summary
> 459fb460-5ace-11e2-9b92-11d67b6163b4
>
>
> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
> 
>
> ** **
>
> $> ls -alF
> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
> 
>
> -rw-rw-r-- 1 sandra sandra 5252320 Jan 16 08:42
> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
> 
>
> ** **
>
> $> ./bin/sstable2json
> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-386179-Data.db
> -k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 "%x"')
> 
>
> {
>
> "34353966623436302d356163652d313165322d396239322d313164363762363136336234":
> [["app_name","50f21d3d",1357785277207001,"d"],
> ["client_ip","50f21d3d",1357785277207001,"d"],
> ["client_req_id","50f21d3d",1357785277207001,"d"],
> ["mysql_call_cnt","50f21d3d",1357785277207001,"d"],
> ["mysql_duration_us","50f21d3d",1357785277207001,"d"],
> ["mysql_failure_call_cnt","50f21d3d",1357785277207001,"d"],
> ["mysql_success_call_cnt","50f21d3d",1357785277207001,"d"],
> ["req_duration_us","50f21d3d",1357785277207001,"d"],
> ["req_finish_time_us","50f21d3d",1357785277207001,"d"],
> ["req_method","50f21d3d",1357785277207001,"d"],
> ["req_service","50f21d3d",1357785277207001,"d"],
> ["req_start_time_us","50f21d3d",1357785277207001,"d"],
> ["success","50f21d3d",1357785277207001,"d"]]
>
> }
>
> ** **
>
> ** **
>
> Decoding the column timestamps to shows that the columns were written at
> "Thu, 10 Jan 2013 02:34:37 GMT" and that their TTL expired at "Sun, 13 Jan
> 2013 02:34:37 GMT".  The date of the SSTable shows that it was generated on
> Jan 16 which is 3 days after all columns have TTL-ed out.
>
> ** **
>
> ** **
>
> The schema shows that gc_grace is set to 0 since this data is write-once,
> read-seldom and is never updated or deleted.
>
> ** **
>
> create column family request_summary
>
>   with column_type = 'Standard'
>
>   and comparator = 'UTF8Type'
>
>   and default_validation_class = 'UTF8Type'
>
>   and key_validation_class = 'UTF8Type'
>
>   and read_repair_chance = 0.1
>
>   and dclocal_read_repair_chance = 0.0
>
>   and gc_grace = 0
>
>   and min_compaction_threshold = 4
>
>   and max_compaction_threshold = 32
>
>   and replicate_on_write = true
>
>   and compaction_strategy =
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
>
>   and caching = 'NONE'
>
>   and bloom_filter_fp_chance = 1.0
>
>   and compression_options = {'chunk_length_kb' : '64',
> 'sstable_compression' :
> 'org.apache.cassandra.io.compress.SnappyCompressor'};
>
> ** **
>
> ** **
>
> Thanks in advance for help in understanding why rows such as this are not
> removed!
>
> ** **
>
> -Bryan
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
<><>

Re: LCS not removing rows with all TTL expired columns

2013-01-17 Thread Bryan Talbot

I'm able to reproduce this behavior on my laptop using 1.1.5, 1.1.7, 1.1.8,
a trivial schema, and a simple script that just inserts rows.  If the TTL
is small enough so that all LCS data fits in generation 0 then the rows
seem to be removed with TTL expires as desired.  However, if the insertion
rate is high enough or the TTL long enough then the data keep accumulating
for far longer than expected.

Using 120 second TTL and a single threaded php insertion script my MBP with
SSD retained almost all of the data.  120 seconds should accumulate 5-10 MB
of data.  I would expect that TTL rows to be removed eventually and for the
cassandra load to level off at some reasonable value near 10 MB.  After
running for 2 hours and with a cassandra load of ~550 MB I stopped the test.

The schema is

create keyspace test
  with placement_strategy = 'SimpleStrategy'
  and strategy_options = {replication_factor : 1}
  and durable_writes = true;

use test;

create column family test
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'TimeUUIDType'
  and compaction_strategy =
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
  and caching = 'NONE'
  and bloom_filter_fp_chance = 1.0
  and column_metadata = [
{column_name : 'a',
validation_class : LongType}];


and the insert script is

insert(UUID::uuid1(), array("a" => 1), null, 120);
}

// Close our connections
$pool->close();
$sys->close();

?>


-Bryan




On Thu, Jan 17, 2013 at 10:11 AM, Bryan Talbot wrote:

> We are using LCS and the particular row I've referenced has been involved
> in several compactions after all columns have TTL expired.  The most recent
> one was again this morning and the row is still there -- TTL expired for
> several days now with gc_grace=0 and several compactions later ...
>
>
> $> ./bin/nodetool -h localhost getsstables metrics request_summary
> 459fb460-5ace-11e2-9b92-11d67b6163b4
>
> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
>
> $> ls -alF
> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
> -rw-rw-r-- 1 sandra sandra 5246509 Jan 17 06:54
> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
>
>
> $> ./bin/sstable2json
> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
> -k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 "%x"')
> {
> "34353966623436302d356163652d313165322d396239322d313164363762363136336234":
> [["app_name","50f21d3d",1357785277207001,"d"],
> ["client_ip","50f21d3d",1357785277207001,"d"],
> ["client_req_id","50f21d3d",1357785277207001,"d"],
> ["mysql_call_cnt","50f21d3d",1357785277207001,"d"],
> ["mysql_duration_us","50f21d3d",1357785277207001,"d"],
> ["mysql_failure_call_cnt","50f21d3d",1357785277207001,"d"],
> ["mysql_success_call_cnt","50f21d3d",1357785277207001,"d"],
> ["req_duration_us","50f21d3d",1357785277207001,"d"],
> ["req_finish_time_us","50f21d3d",1357785277207001,"d"],
> ["req_method","50f21d3d",1357785277207001,"d"],
> ["req_service","50f21d3d",1357785277207001,"d"],
> ["req_start_time_us","50f21d3d",1357785277207001,"d"],
> ["success","50f21d3d",1357785277207001,"d"]]
> }
>
>
> My experience with TTL columns so far has been pretty similar to Viktor's
> in that the only way to keep them row count under control is to force major
> compactions.  In real world use, STCS and LCS both leave TTL expired rows
> around forever as far as I can tell.  When testing with minimal data,
> removal of TTL expired rows seem to work as expected but in this case there
> seems to be some divergence from real life work and test samples.
>
> -Bryan
>
>
>
>
> On Thu, Jan 17, 2013 at 1:47 AM, Viktor Jevdokimov <
> viktor.jevdoki...@adform.com> wrote:
>
>>  @Bryan,
>>
>> ** **
>>
>> To keep data size as low as possible with TTL columns we still use STCS
>> and nightly major compactions.
>>
>> ** **
>>
>> Experience with LCS was not successful in our case, data size keeps too
>> high along with amount of compactions.
>>
>> ** **
>>
>&

Re: LCS not removing rows with all TTL expired columns

2013-01-17 Thread Bryan Talbot

Bleh, I rushed out the email before some meetings and I messed something
up.  Working on reproducing now with better notes this time.

-Bryan



On Thu, Jan 17, 2013 at 4:45 PM, Derek Williams  wrote:

> When you ran this test, is that the exact schema you used? I'm not seeing
> where you are setting gc_grace to 0 (although I could just be blind, it
> happens).
>
>
> On Thu, Jan 17, 2013 at 5:01 PM, Bryan Talbot wrote:
>
>> I'm able to reproduce this behavior on my laptop using 1.1.5, 1.1.7,
>> 1.1.8, a trivial schema, and a simple script that just inserts rows.  If
>> the TTL is small enough so that all LCS data fits in generation 0 then the
>> rows seem to be removed with TTL expires as desired.  However, if the
>> insertion rate is high enough or the TTL long enough then the data keep
>> accumulating for far longer than expected.
>>
>> Using 120 second TTL and a single threaded php insertion script my MBP
>> with SSD retained almost all of the data.  120 seconds should accumulate
>> 5-10 MB of data.  I would expect that TTL rows to be removed eventually and
>> for the cassandra load to level off at some reasonable value near 10 MB.
>>  After running for 2 hours and with a cassandra load of ~550 MB I stopped
>> the test.
>>
>> The schema is
>>
>> create keyspace test
>>   with placement_strategy = 'SimpleStrategy'
>>   and strategy_options = {replication_factor : 1}
>>   and durable_writes = true;
>>
>> use test;
>>
>> create column family test
>>   with column_type = 'Standard'
>>   and comparator = 'UTF8Type'
>>   and default_validation_class = 'UTF8Type'
>>   and key_validation_class = 'TimeUUIDType'
>>   and compaction_strategy =
>> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
>>   and caching = 'NONE'
>>   and bloom_filter_fp_chance = 1.0
>>   and column_metadata = [
>> {column_name : 'a',
>> validation_class : LongType}];
>>
>>
>> and the insert script is
>>
>> >
>> require_once('phpcassa/1.0.a.5/autoload.php');
>>
>> use phpcassa\Connection\ConnectionPool;
>> use phpcassa\ColumnFamily;
>> use phpcassa\SystemManager;
>> use phpcassa\UUID;
>>
>> // Connect to test keyspace and column family
>> $sys = new SystemManager('127.0.0.1');
>>
>> // Start a connection pool, create our ColumnFamily instance
>> $pool = new ConnectionPool('test', array('127.0.0.1'));
>> $testCf = new ColumnFamily($pool, 'test');
>>
>> // Insert records
>> while( 1 ) {
>>   $testCf->insert(UUID::uuid1(), array("a" => 1), null, 120);
>> }
>>
>> // Close our connections
>> $pool->close();
>> $sys->close();
>>
>> ?>
>>
>>
>> -Bryan
>>
>>
>>
>>
>> On Thu, Jan 17, 2013 at 10:11 AM, Bryan Talbot wrote:
>>
>>> We are using LCS and the particular row I've referenced has been
>>> involved in several compactions after all columns have TTL expired.  The
>>> most recent one was again this morning and the row is still there -- TTL
>>> expired for several days now with gc_grace=0 and several compactions later
>>> ...
>>>
>>>
>>> $> ./bin/nodetool -h localhost getsstables metrics request_summary
>>> 459fb460-5ace-11e2-9b92-11d67b6163b4
>>>
>>> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
>>>
>>> $> ls -alF
>>> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
>>> -rw-rw-r-- 1 sandra sandra 5246509 Jan 17 06:54
>>> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
>>>
>>>
>>> $> ./bin/sstable2json
>>> /virtual/cassandra/data/data/metrics/request_summary/metrics-request_summary-he-448955-Data.db
>>> -k $(echo -n 459fb460-5ace-11e2-9b92-11d67b6163b4 | hexdump  -e '36/1 "%x"')
>>>  {
>>> "34353966623436302d356163652d313165322d396239322d313164363762363136336234":
>>> [["app_name","50f21d3d",1357785277207001,"d"],
>>> ["client_ip","50f21d3d",1357785277207001,"d"],
>>> ["client_req_id","50f21d3d",1357785277207001,"d"],
>>> ["mysql_call_cnt","50f21d3d",1357785277207001,"d"]

Re: LCS not removing rows with all TTL expired columns

2013-01-22 Thread Bryan Talbot

It turns out that having gc_grace=0 isn't required to produce the problem.
 My colleague did a lot of digging into the compaction code and we think
he's found the issue.  It's detailed in
https://issues.apache.org/jira/browse/CASSANDRA-5182

Basically tombstones for a row will not be removed from an SSTable during
compaction if the row appears in other SSTables; however, the compaction
code checks the bloom filters to make this determination.  Since this data
is rarely read we had the bloom_filter_fp_ratio set to 1.0 which makes rows
seem to appear in every SSTable as far as compaction is concerned.

This caused our data to essentially never be removed when using either STSC
or LCS and will probably affect anyone else running 1.1 with high bloom
filter fp ratios.

Setting our fp ratio to 0.1, running upgradesstables and running the
application as it was before seems to have stabilized the load as desired
at the expense of additional jvm memory.

-Bryan


On Thu, Jan 17, 2013 at 6:50 PM, Bryan Talbot wrote:

> Bleh, I rushed out the email before some meetings and I messed something
> up.  Working on reproducing now with better notes this time.
>
> -Bryan
>
>
>
> On Thu, Jan 17, 2013 at 4:45 PM, Derek Williams  wrote:
>
>> When you ran this test, is that the exact schema you used? I'm not seeing
>> where you are setting gc_grace to 0 (although I could just be blind, it
>> happens).
>>
>>
>> On Thu, Jan 17, 2013 at 5:01 PM, Bryan Talbot wrote:
>>
>>> I'm able to reproduce this behavior on my laptop using 1.1.5, 1.1.7,
>>> 1.1.8, a trivial schema, and a simple script that just inserts rows.  If
>>> the TTL is small enough so that all LCS data fits in generation 0 then the
>>> rows seem to be removed with TTL expires as desired.  However, if the
>>> insertion rate is high enough or the TTL long enough then the data keep
>>> accumulating for far longer than expected.
>>>
>>> Using 120 second TTL and a single threaded php insertion script my MBP
>>> with SSD retained almost all of the data.  120 seconds should accumulate
>>> 5-10 MB of data.  I would expect that TTL rows to be removed eventually and
>>> for the cassandra load to level off at some reasonable value near 10 MB.
>>>  After running for 2 hours and with a cassandra load of ~550 MB I stopped
>>> the test.
>>>
>>> The schema is
>>>
>>> create keyspace test
>>>   with placement_strategy = 'SimpleStrategy'
>>>   and strategy_options = {replication_factor : 1}
>>>   and durable_writes = true;
>>>
>>> use test;
>>>
>>> create column family test
>>>   with column_type = 'Standard'
>>>   and comparator = 'UTF8Type'
>>>   and default_validation_class = 'UTF8Type'
>>>   and key_validation_class = 'TimeUUIDType'
>>>   and compaction_strategy =
>>> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
>>>   and caching = 'NONE'
>>>   and bloom_filter_fp_chance = 1.0
>>>   and column_metadata = [
>>> {column_name : 'a',
>>> validation_class : LongType}];
>>>
>>>
>>> and the insert script is
>>>
>>> >>
>>> require_once('phpcassa/1.0.a.5/autoload.php');
>>>
>>> use phpcassa\Connection\ConnectionPool;
>>> use phpcassa\ColumnFamily;
>>> use phpcassa\SystemManager;
>>> use phpcassa\UUID;
>>>
>>> // Connect to test keyspace and column family
>>> $sys = new SystemManager('127.0.0.1');
>>>
>>> // Start a connection pool, create our ColumnFamily instance
>>> $pool = new ConnectionPool('test', array('127.0.0.1'));
>>> $testCf = new ColumnFamily($pool, 'test');
>>>
>>> // Insert records
>>> while( 1 ) {
>>>   $testCf->insert(UUID::uuid1(), array("a" => 1), null, 120);
>>> }
>>>
>>> // Close our connections
>>> $pool->close();
>>> $sys->close();
>>>
>>> ?>
>>>
>>>
>>> -Bryan
>>>
>>>
>>>
>>>
>>> On Thu, Jan 17, 2013 at 10:11 AM, Bryan Talbot 
>>> wrote:
>>>
>>>> We are using LCS and the particular row I've referenced has been
>>>> involved in several compactions after all columns have TTL expired.  The
>>>> most recent one was again this morning and the row is still there -- TTL
>>>> expire

Re: too many warnings of Heap is full

2013-01-30 Thread Bryan Talbot

My guess is that those one or two nodes with the gc pressure also have more
rows in your big CF.  More rows could be due to imbalanced distribution if
your'e not using a random partitioner or from those nodes not yet removing
deleted rows which other nodes may have done.

JVM heap space is used for a few things which scale with key count
including:
- bloom filter (for C* < 1.2)
- index samples

Other space is used but can be more easily controlled by tuning for
- memtable
- compaction
- key cache
- row cache


So, if those nodes have more rows (check using "nodetool ring" or "nodetool
cfstats") than the others you can try to:
- reduce the number of rows by adding nodes, run manual / tune compactions
to remove rows with expired tombstones, etc.
- increase bloom filter fp chance
- increase jvm heap size (don't go too big)
- disable key or row cache
- increase index sample interval

Not all of those things are generally good especially to the extreme so
don't go setting a 20 GB jvm heap without understanding
the consequences for example.

-Bryan


On Wed, Jan 30, 2013 at 3:47 AM, Guillermo Barbero <
guillermo.barb...@spotbros.com> wrote:

> Hi,
>
>   I'm viewing a weird behaviour in my cassandra cluster. Most of the
> warning messages are due to Heap is % full. According to this link
> (
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassndra-1-0-6-GC-query-tt7323457.html
> )
> there are two ways to "reduce pressure":
> 1. Decrease the cache sizes
> 2. Increase the index interval size
>
> Most of the flushes are in two column families (users and messages), I
> guess that's because the most mutations are there.
>
> I still have not applied those changes to the production environment.
> Do you recommend any other meassure? Should I set specific tunning for
> these two CFs? Should I check another metric?
>
> Additionally, the distribution of warning messages is not uniform
> along the cluster. Why could cassandra be doing this? What should I do
> to find out how to fix this?
>
> cassandra runs on a 6 node cluster of m1.xlarge machines (Amazon EC2)
> the java version is the following:
> java version "1.6.0_37"
> Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)
>
> The cassandra system.log is resumed here (numer of messages, cassandra
> node, class that reports the message, first word of the message)
> 2013-01-26
>   5 cassNode0: GCInspector.java Heap
>   5 cassNode0: StorageService.java Flushing
> 232 cassNode2: GCInspector.java Heap
> 232 cassNode2: StorageService.java Flushing
> 104 cassNode3: GCInspector.java Heap
> 104 cassNode3: StorageService.java Flushing
>   3 cassNode4: GCInspector.java Heap
>   3 cassNode4: StorageService.java Flushing
>   3 cassNode5: GCInspector.java Heap
>   3 cassNode5: StorageService.java Flushing
>
> 2013-01-27
>   2 cassNode0: GCInspector.java Heap
>   2 cassNode0: StorageService.java Flushing
>   3 cassNode1: GCInspector.java Heap
>   3 cassNode1: StorageService.java Flushing
> 189 cassNode2: GCInspector.java Heap
> 189 cassNode2: StorageService.java Flushing
> 104 cassNode3: GCInspector.java Heap
> 104 cassNode3: StorageService.java Flushing
>   1 cassNode4: GCInspector.java Heap
>   1 cassNode4: StorageService.java Flushing
>   1 cassNode5: GCInspector.java Heap
>   1 cassNode5: StorageService.java Flushing
>
> 2013-01-28
>   2 cassNode0: GCInspector.java Heap
>   2 cassNode0: StorageService.java Flushing
>   1 cassNode1: GCInspector.java Heap
>   1 cassNode1: StorageService.java Flushing
>   1 cassNode2: AutoSavingCache.java Reducing
> 343 cassNode2: GCInspector.java Heap
> 342 cassNode2: StorageService.java Flushing
> 181 cassNode3: GCInspector.java Heap
> 181 cassNode3: StorageService.java Flushing
>   4 cassNode4: GCInspector.java Heap
>   4 cassNode4: StorageService.java Flushing
>   3 cassNode5: GCInspector.java Heap
>   3 cassNode5: StorageService.java Flushing
>
> 2013-01-29
>   2 cassNode0: GCInspector.java Heap
>   2 cassNode0: StorageService.java Flushing
>   3 cassNode1: GCInspector.java Heap
>   3 cassNode1: StorageService.java Flushing
> 156 cassNode2: GCInspector.java Heap
> 156 cassNode2: StorageService.java Flushing
>  71 cassNode3: GCInspector.java Heap
>  71 cassNode3: StorageService.java Flushing
>   2 cassNode4: GCInspector.java Heap
>   2 cassNode4: StorageService.java Flushing
>   2 cassNode5: GCInspector.java Heap
>   1 cassNode5: Memtable.java setting
>   2 cassNode5: StorageService.java Flushing
>
> --
>
> Guillermo Barbero - Backend Team
>
> Spotbros Technologies
>

Re: too many warnings of Heap is full

2013-01-30 Thread Bryan Talbot

On Wed, Jan 30, 2013 at 2:44 PM, Guillermo Barbero <
guillermo.barb...@spotbros.com> wrote:

>  WARN [MemoryMeter:1] 2013-01-30 21:37:48,079 Memtable.java (line 202)
> setting live ratio to maximum of 64.0 instead of 751.6512549537648
>
>
This looks interesting.  Doesn't this mean that the ratio of space used for
that CF for memory to serialized form is 751:1 but was forced to a lower
"sane" value?

-Bryan

Re: Upgrade from 0.6.x to 1.2.x

2013-02-07 Thread Bryan Talbot

Wow, that's pretty ambitions expecting an upgrade which skips 4 major
versions (0.7, 0.8, 1.0, 1.1) to work.

I think you're going to have to follow the upgrade path for each of those
intermediate steps and not upgrade in one big jump.

-Bryan



On Thu, Feb 7, 2013 at 3:41 AM, Sergey Leschenko wrote:

> Hi, all
>
> I'm trying to update our old version 0.6.5 to current 1.2.1
> All nodes has been drained and stopped. Proper cassandra.yaml created,
> schema file prepared.
>
> Trying to start version 1.2.1 on the one node  (full output attached to
> email):
> ...
> ERROR 11:12:44,530 Exception encountered during startup
> java.lang.NullPointerException
> at
> org.apache.cassandra.db.SystemTable.upgradeSystemData(SystemTable.java:161)
> at
> org.apache.cassandra.db.SystemTable.finishStartup(SystemTable.java:107)
> at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:276)
> at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:370)
> at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:413)
> java.lang.NullPointerException
> at
> org.apache.cassandra.db.SystemTable.upgradeSystemData(SystemTable.java:161)
> at
> org.apache.cassandra.db.SystemTable.finishStartup(SystemTable.java:107)
> at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:276)
> at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:370)
> at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:413)
> Exception encountered during startup: null
>
> On the next attempts daemon started, but still with AssertionErrors
>
> Question 1 - is it possible start the new version from the first attempt?
>
>
> Then I loaded schema via cassandra-cli, and run nodetool scrub - which
> caused a big number of  warnings in log:
>OutputHandler.java (line 52) Index file contained a different key
> or row size; using key from data file
>
> storage-conf.xml from 0.6.5 has column family defined as
>
> for 1.2.1 I used
>   create column family Invoices with column_type = 'Standard' and
> comparator = 'BytesType';
>
> Question 2 - how to get rid of these warnings? Are they connected to
> column family definition?
>
> Thanks
>
> --
> Sergey
>

Re: Cluster not accepting insert while one node is down

2013-02-14 Thread Bryan Talbot

Generally data isn't written to whatever node the client connects to.  In
your case, a row is written to one of the nodes based on the hash of the
row key.  If that one replica node is down, it won't matter which
coordinator node you attempt a write with CL.ONE: the write will fail.

If you want the write to succeed, you could do any one of: write with
CL.ANY, increase RF to 2+, write using a row key that hashes to an UP node.

-Bryan



On Thu, Feb 14, 2013 at 2:06 AM, Alain RODRIGUEZ  wrote:

> I will let commiters or anyone that has knowledge on Cassandra internal
> answer this.
>
> From what I understand, you should be able to insert data on any up node
> with your configuration...
>
> Alain
>
>
> 2013/2/14 Traian Fratean 
>
>> You're right as regarding data availability on that node. And my config,
>> being the default one, is not suited for a cluster.
>> What I don't get is that my 67 node was down and I was trying to insert
>> in 66 node, as can be seen from the stacktrace. Long story short: when node
>> 67 was down I could not insert into any machine in the cluster. Not what I
>> was expecting.
>>
>> Thank you for the reply!
>> Traian.
>>
>> 2013/2/14 Alain RODRIGUEZ 
>>
>>> Hi Traian,
>>>
>>> There is your problem. You are using RF=1, meaning that each node is
>>> responsible for its range, and nothing more. So when a node goes down, do
>>> the math, you just can't read 1/5 of your data.
>>>
>>> This is very cool for performances since each node owns its own part of
>>> the data and any write or read need to reach only one node, but it removes
>>> the SPOF, which is a main point of using C*. So you have poor availability
>>> and poor consistency.
>>>
>>> An usual configuration with 5 nodes would be RF=3 and both CL (R&W) =
>>> QUORUM.
>>>
>>> This will replicate your data to 2 nodes + the natural endpoints (total
>>> of 3/5 nodes owning any data) and any read or write would need to reach at
>>> least 2 nodes before being considered as being successful ensuring a strong
>>> consistency.
>>>
>>> This configuration allow you to shut down a node (crash or configuration
>>> update/rolling restart) without degrading the service (at least allowing
>>> you to reach any data) but at cost of more data on each node.
>>>
>>> Alain
>>>
>>>
>>> 2013/2/14 Traian Fratean 
>>>
 I am using defaults for both RF and CL. As the keyspace was created
 using cassandra-cli the default RF should be 1 as I get it from below:

 [default@TestSpace] describe;
 Keyspace: TestSpace:
   Replication Strategy:
 org.apache.cassandra.locator.NetworkTopologyStrategy
   Durable Writes: true
 Options: [datacenter1:1]

 As for the CL it the Astyanax default, which is 1 for both reads and
 writes.

 Traian.


 2013/2/13 Alain RODRIGUEZ 

> We probably need more info like the RF of your cluster and CL of your
> reads and writes. Maybe could you also tell us if you use vnodes or not.
>
> I heard that Astyanax was not running very smoothly on 1.2.0, but a
> bit better on 1.2.1. Yet, Netflix didn't release a version of Astyanax for
> C*1.2.
>
> Alain
>
>
> 2013/2/13 Traian Fratean 
>
>> Hi,
>>
>> I have a cluster of 5 nodes running Cassandra 1.2.0 . I have a Java
>> client with Astyanax 1.56.21.
>> When a node(10.60.15.67 - *diiferent* from the one in the stacktrace
>> below) went down I get TokenRandeOfflineException and no other data gets
>> inserted into *any other* node from the cluster.
>>
>> Am I having a configuration issue or this is supposed to happen?
>>
>>
>> com.netflix.astyanax.connectionpool.impl.CountingConnectionPoolMonitor.trackError(CountingConnectionPoolMonitor.java:81)
>> -
>> com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException:
>> TokenRangeOfflineException: [host=10.60.15.66(10.60.15.66):9160,
>> latency=2057(2057), attempts=1]UnavailableException()
>> com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException:
>> TokenRangeOfflineException: [host=10.60.15.66(10.60.15.66):9160,
>> latency=2057(2057), attempts=1]UnavailableException()
>> at
>> com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:165)
>>  at
>> com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60)
>> at
>> com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:27)
>>  at
>> com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$1.execute(ThriftSyncConnectionFactoryImpl.java:140)
>> at
>> com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:69)
>>  at
>> com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:255)
>>
>

Re: heap usage

2013-02-15 Thread Bryan Talbot

Aren't bloom filters kept off heap in 1.2?
https://issues.apache.org/jira/browse/CASSANDRA-4865

Disabling bloom filters also disables tombstone removal as well, so don't
disable them if you delete anything.

https://issues.apache.org/jira/browse/CASSANDRA-5182

I believe that the index samples (by default every 128th entry) are still
kept in in memory so your JVM memory will scale with the number of rows
stored.  Additional memory is used for every keyspace and CF too so if you
have thousands of CF that could be an issue.

-Bryan



On Fri, Feb 15, 2013 at 8:16 AM, Edward Capriolo wrote:

> It is not going to be true for long that LCS does not require bloom
> filters.
>
> https://issues.apache.org/jira/browse/CASSANDRA-5029
>
> Apparently, without bloom filters there are issues.
>
> On Fri, Feb 15, 2013 at 7:29 AM, Blake Manders 
> wrote:
> >
> > You probably want to look at your bloom filters.  Be forewarned though,
> > they're difficult to change; changes to bloom filter settings only apply
> to
> > new SSTables, so they might not be noticeable until a few compactions
> have
> > taken place.
> >
> > If that is your issue, and your usage model fits it, a good alternative
> to
> > the slow propagation of higher miss rates is to switch to LCS (which
> doesn't
> > use bloom filters), which won't require you to make the jump to 1.2.
> >
> >
> > On Fri, Feb 15, 2013 at 4:06 AM, Reik Schatz 
> wrote:
> >>
> >> Hi,
> >>
> >> recently we are hitting some OOM: Java heap space, so I was
> investigating
> >> how the heap is used in Cassandra 1.2+
> >>
> >> We use the calculated 4G heap. Our cluster is 6 nodes, around 750 GB
> data
> >> and a replication factor of 3. Row cache is disabled. All key cache and
> >> memtable settings are left at default.
> >>
> >> Is the primary key index kept in heap memory? We have a bunch of
> keyspaces
> >> and column families.
> >>
> >> Thanks,
> >> Rik
> >
> >
> >
> >
> > --
> >
> > Blake Manders | CTO
> >
> > Cross Pixel, Inc. | 494 8th Ave, Penthouse | NYC 10001
> >
> > Website: crosspixel.net
> > Twitter: twitter.com/CrossPix
>

Re: Deletion consistency

2013-02-15 Thread Bryan Talbot

With a RF and CL of one, there is no replication so there can be no issue
with distributed deletes.  Writes (and reads) can only go to the one host
that has the data and will be refused if that node is down.  I'd guess that
your app isn't deleting records when you think that it is, or that the
delete is failing but not being detected as failed.

-Bryan



On Fri, Feb 15, 2013 at 10:21 AM, Mike  wrote:

> If you increase the number of nodes to 3, with an RF of 3, then you should
> be able to read/delete utilizing a quorum consistency level, which I
> believe will help here.  Also, make sure the time of your servers are in
> sync, utilizing NTP, as drifting time between you client and server could
> cause updates to be mistakenly dropped for being old.
>
> Also, make sure you are running with a gc_grace period that is high
> enough.  The default is 10 days.
>
> Hope this helps,
> -Mike
>
>
> On 2/15/2013 1:13 PM, Víctor Hugo Oliveira Molinar wrote:
>
>> hello everyone!
>>
>> I have a column family filled with event objects which need to be
>> processed by query threads.
>> Once each thread query for those objects(spread among columns bellow a
>> row), it performs a delete operation for each object in cassandra.
>> It's done in order to ensure that these events wont be processed again.
>> Some tests has showed me that it works, but sometimes i'm not getting
>> those events deleted. I checked it through cassandra-cli,etc.
>>
>> So, reading it 
>> (http://wiki.apache.org/**cassandra/DistributedDeletes)
>> I came to a conclusion that I may be reading old data.
>> My cluster is currently configured as: 2 nodes, RF1, CL 1.
>> In that case, what should I do?
>>
>> - Increase the consistency level for the write operations( in that case,
>> the deletions ). In order to ensure that those deletions are stored in all
>> nodes.
>> or
>> - Increase the consistency level for the read operations. In order to
>> ensure that I'm reading only those yet processed events(deleted).
>>
>> ?
>>
>> -
>> Thanks in advance
>>
>>
>>
>

Re: cassandra vs. mongodb quick question(good additional info)

2013-02-20 Thread Bryan Talbot

This calculation is incorrect btw.  10,000 GB transferred at 1.25 GB / sec
would complete in about 8,000 seconds which is just 2.2 hours and not 5.5
days.  The error is in the conversion (1hr/60secs) which is off by 2 orders
of magnitude since (1hr/3600secs) is the correct conversion.

-Bryan

On Mon, Feb 18, 2013 at 5:00 PM, Hiller, Dean  wrote:

> Google "10 gigabit in gigabytes" gives me 1.25 gigabytes/second  (yes I
> could have divided by 8 in my head but eh…course when I saw the number, I
> went duh)
>
> So trying to transfer 10 Terabytes  or 10,000 Gigabytes to a node that we
> are bringing online to replace a dead node would take approximately 5
> days???
>
> This means no one else is using the bandwidth too ;).  10,000Gigabytes * 1
> second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days.  This is more
> likely 11 days if we only use 50% of the network.
>
> So bringing a new node up to speed is more like 11 days once it is
> crashed.  I think this is the main reason the 1Terabyte exists to begin
> with, right?
>
>

Re: cassandra vs. mongodb quick question(good additional info)

2013-02-20 Thread Bryan Talbot

There seem to be some data structures in cassandra which scale with the
number of rows stored and consume in-jvm memory without bound (other than
number of rows).  Even with 1.2, I think that index samples are still kept
in-jvm so you may need to tune index_interval.  Unfortunately that is a
global value so it will affect all CF and not just the big ones that need
it to be different.

There may be other issues (like during compaction) but that one pops out.
 Prior to 1.2, bloom filters would be a big problem too.

-Bryan



On Wed, Feb 20, 2013 at 12:20 PM, Hiller, Dean  wrote:

> Heh, we just discovered that mistake a few minutes ago….thanks though.  I
> am now wondering and may run a test cluster with a separate 6 nodes and
> test how compaction is on very large data sets and such.  We have tons of
> research data that sits there so I am wondering if 20T / node is now
> feasible with cassandra(I mean if mongodb has a 42T which 10gen was telling
> my colleague, I would think we can with cassandra).
>
> Is there any reasons I should know up front that 20T per node won't work.
>  We have 20 disks per node and this definitely has a different profile than
> previous cassandra systems I have setup.  We don't need really any caching
> as disk access is typically fine on reads.
>
> Thanks,
> Dean
>
> From: Bryan Talbot mailto:btal...@aeriagames.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Wednesday, February 20, 2013 1:04 PM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: cassandra vs. mongodb quick question(good additional info)
>
> This calculation is incorrect btw.  10,000 GB transferred at 1.25 GB / sec
> would complete in about 8,000 seconds which is just 2.2 hours and not 5.5
> days.  The error is in the conversion (1hr/60secs) which is off by 2 orders
> of magnitude since (1hr/3600secs) is the correct conversion.
>
> -Bryan
>
>
> On Mon, Feb 18, 2013 at 5:00 PM, Hiller, Dean  <mailto:dean.hil...@nrel.gov>> wrote:
> Google "10 gigabit in gigabytes" gives me 1.25 gigabytes/second  (yes I
> could have divided by 8 in my head but eh…course when I saw the number, I
> went duh)
>
> So trying to transfer 10 Terabytes  or 10,000 Gigabytes to a node that we
> are bringing online to replace a dead node would take approximately 5
> days???
>
> This means no one else is using the bandwidth too ;).  10,000Gigabytes * 1
> second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days.  This is more
> likely 11 days if we only use 50% of the network.
>
> So bringing a new node up to speed is more like 11 days once it is
> crashed.  I think this is the main reason the 1Terabyte exists to begin
> with, right?
>
>

Re: disabling bloomfilter not working? or did I do this wrong?

2013-02-22 Thread Bryan Talbot

I see from your read and write count that your nreldata CF has nearly equal
number of reads as writes.  I would expect that disabling your bloom filter
is going to hurt your read performance quite a bit.

Also, beware that disabling your bloom filter may also cause tombstoned
rows to never be deleted, so if you delete all columns explicitly or use
TTL, your data may grow more than your expect.
https://issues.apache.org/jira/browse/CASSANDRA-5182

-Bryan




On Fri, Feb 22, 2013 at 11:59 AM, Hiller, Dean  wrote:

> Thanks, but I found out it is still running.  It looks like I have about a
> 5 hour wait left for my upgradesstables(waited 4 hours already).  I will
> check the bloomfilter after that.
>
> Out of curiosity, if I had much wider rows (ie. < 900k) per row, will
> compaction run faster(e…upgradesstables) at all or would it basically
> run at the same speed.
>
> I guess what I am wondering is 9 hours a normal compaction time for 130gb
> of data?
>
> Thanks,
> Dean
>
> From: aaron morton mailto:aa...@thelastpickle.com
> >>
> Reply-To: "user@cassandra.apache.org" <
> user@cassandra.apache.org>
> Date: Friday, February 22, 2013 10:29 AM
> To: "user@cassandra.apache.org" <
> user@cassandra.apache.org>
> Subject: Re: disabling bloomfilter not working? or did I do this wrong?
>
> Bloom Filter Space Used: 2318392048
> Just to be sane do a quick check of the -Filter.db files on disk for this
> CF.
> If they are very small try a restart on the node.
>
> Number of Keys (estimate): 1249133696
> Hey a billion rows on a node, what an age we live in :)
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 23/02/2013, at 4:35 AM, "Hiller, Dean"  dean.hil...@nrel.gov>> wrote:
>
> So in the cli, I ran
>
> update column family nreldata with bloom_filter_fp_chance=1.0;
>
> Then I ran
>
> nodetool upgradesstables databus5 nreldata;
>
> But my bloom filter size is still around 2gig(and I want to free up this
> heap) According to nodetool cfstats command…
>
> Column Family: nreldata
> SSTable count: 10
> Space used (live): 96841497731
> Space used (total): 96841497731
> Number of Keys (estimate): 1249133696
> Memtable Columns Count: 7066
> Memtable Data Size: 4286174
> Memtable Switch Count: 924
> Read Count: 19087150
> Read Latency: 0.595 ms.
> Write Count: 21281994
> Write Latency: 0.013 ms.
> Pending Tasks: 0
> Bloom Filter False Postives: 974393
> Bloom Filter False Ratio: 0.8
> Bloom Filter Space Used: 2318392048
> Compacted row minimum size: 73
> Compacted row maximum size: 446
> Compacted row mean size: 143
>
>
>
>

Re: Reading old data problem

2013-02-28 Thread Bryan Talbot

On Thu, Feb 28, 2013 at 5:08 PM, Víctor Hugo Oliveira Molinar <
vhmoli...@gmail.com> wrote:

> Ok guys let me try to ask it in a different way:
>
> Will repair totally ensure a data synchronism among nodes?

If there are no writes happening on the cluster then yes.  Otherwise, the
answer is "it depends" since all the normal things that lead to
inconsistencies can still happen.

>
> Extra question:
> Once I write at CL=All, will C* ensure that I can read from ANY node
> without an inconsistency? The reverse state, writing at CL=One but reading
> at CL=All will also ensure that?
>
>

You can get consistent behavior if CL.read + CL.write > RF.  So since you
have just 2 nodes and RF=2, you'd need to have at least CL.read=2 and
CL.write=1 or CL.read=1 and CL.write=2.

-Bryan

>
>
> On Wed, Feb 27, 2013 at 11:24 PM, Víctor Hugo Oliveira Molinar <
> vhmoli...@gmail.com> wrote:
>
>> Hello, I need some help to manage my live cluster!
>>
>> I'm  currently running a cluster with 2 nodes, RF:2, CL:1.
>> Since I'm limited to hardware upgrade issues, I'm not able to increase my
>> ConsitencyLevel for now.
>>
>> Anyway, * *I ran a full repair on each node of the cluster followed by a
>> flush. Although I'm still reading old data when performing queries.
>>
>> Well it's know that I might read old data during normal operations, but
>> shouldnt it be sync after the full antientropy repair?
>> What I'm missing?
>>
>> Thanks in advance!
>>
>
>

Re: old data / tombstones are not deleted after ttl

2013-03-04 Thread Bryan Talbot

Those older files won't be included in a compaction until there are
min_compaction_threshold (4) files of that size.  When you get another SS
table -Data.db file that is about 12-18GB then you'll have 4 and they will
be compacted together into one new file.  At that time, if there are any
rows with only tombstones that are all older than gc_grace the row will be
removed (assuming the row exists exclusively in the 4 input SS tables).
 Columns with data that is more than TTL seconds old will be written with a
tombstone.  If the row does have column values in SS tables that are not
being compacted, the row will not be removed.


-Bryan


On Sun, Mar 3, 2013 at 11:07 PM, Matthias Zeilinger <
matthias.zeilin...@bwinparty.com> wrote:

>  Hi,
>
> ** **
>
> I´m running Cassandra 1.1.5 and have following issue.
>
> ** **
>
> I´m using a 10 days TTL on my CF. I can see a lot of tombstones in there,
> but they aren´t deleted after compaction.
>
> ** **
>
> I have tried a nodetool –cleanup and also a restart of Cassandra, but
> nothing happened.
>
> ** **
>
> total 61G
>
> drwxr-xr-x  2 cassandra dba  20K Mar  4 06:35 .
>
> drwxr-xr-x 10 cassandra dba 4.0K Dec 10 13:05 ..
>
> -rw-r--r--  1 cassandra dba  15M Dec 15 22:04
> whatever-he-1398-CompressionInfo.db
>
> -rw-r--r--  1 cassandra dba  19G Dec 15 22:04 whatever-he-1398-Data.db
>
> -rw-r--r--  1 cassandra dba  15M Dec 15 22:04 whatever-he-1398-Filter.db**
> **
>
> -rw-r--r--  1 cassandra dba 357M Dec 15 22:04 whatever-he-1398-Index.db***
> *
>
> -rw-r--r--  1 cassandra dba 4.3K Dec 15 22:04
> whatever-he-1398-Statistics.db
>
> -rw-r--r--  1 cassandra dba 9.5M Feb  6 15:45
> whatever-he-5464-CompressionInfo.db
>
> -rw-r--r--  1 cassandra dba  12G Feb  6 15:45 whatever-he-5464-Data.db
>
> -rw-r--r--  1 cassandra dba  48M Feb  6 15:45 whatever-he-5464-Filter.db**
> **
>
> -rw-r--r--  1 cassandra dba 736M Feb  6 15:45 whatever-he-5464-Index.db***
> *
>
> -rw-r--r--  1 cassandra dba 4.3K Feb  6 15:45
> whatever-he-5464-Statistics.db
>
> -rw-r--r--  1 cassandra dba 9.7M Feb 21 19:13
> whatever-he-6829-CompressionInfo.db
>
> -rw-r--r--  1 cassandra dba  12G Feb 21 19:13 whatever-he-6829-Data.db
>
> -rw-r--r--  1 cassandra dba  47M Feb 21 19:13 whatever-he-6829-Filter.db**
> **
>
> -rw-r--r--  1 cassandra dba 792M Feb 21 19:13 whatever-he-6829-Index.db***
> *
>
> -rw-r--r--  1 cassandra dba 4.3K Feb 21 19:13
> whatever-he-6829-Statistics.db 
>
> -rw-r--r--  1 cassandra dba 3.7M Mar  1 10:46
> whatever-he-7578-CompressionInfo.db
>
> -rw-r--r--  1 cassandra dba 4.3G Mar  1 10:46 whatever-he-7578-Data.db
>
> -rw-r--r--  1 cassandra dba  12M Mar  1 10:46 whatever-he-7578-Filter.db**
> **
>
> -rw-r--r--  1 cassandra dba 274M Mar  1 10:46 whatever-he-7578-Index.db***
> *
>
> -rw-r--r--  1 cassandra dba 4.3K Mar  1 10:46
> whatever-he-7578-Statistics.db
>
> -rw-r--r--  1 cassandra dba 3.6M Mar  1 11:21
> whatever-he-7582-CompressionInfo.db
>
> -rw-r--r--  1 cassandra dba 4.3G Mar  1 11:21 whatever-he-7582-Data.db
>
> -rw-r--r--  1 cassandra dba 9.7M Mar  1 11:21 whatever-he-7582-Filter.db**
> **
>
> -rw-r--r--  1 cassandra dba 236M Mar  1 11:21 whatever-he-7582-Index.db***
> *
>
> -rw-r--r--  1 cassandra dba 4.3K Mar  1 11:21
> whatever-he-7582-Statistics.db
>
> -rw-r--r--  1 cassandra dba 3.7M Mar  3 12:13
> whatever-he-7869-CompressionInfo.db
>
> -rw-r--r--  1 cassandra dba 4.3G Mar  3 12:13 whatever-he-7869-Data.db
>
> -rw-r--r--  1 cassandra dba 9.8M Mar  3 12:13 whatever-he-7869-Filter.db**
> **
>
> -rw-r--r--  1 cassandra dba 239M Mar  3 12:13 whatever-he-7869-Index.db***
> *
>
> -rw-r--r--  1 cassandra dba 4.3K Mar  3 12:13
> whatever-he-7869-Statistics.db
>
> -rw-r--r--  1 cassandra dba 924K Mar  3 18:02
> whatever-he-7953-CompressionInfo.db
>
> -rw-r--r--  1 cassandra dba 1.1G Mar  3 18:02 whatever-he-7953-Data.db
>
> -rw-r--r--  1 cassandra dba 2.1M Mar  3 18:02 whatever-he-7953-Filter.db**
> **
>
> -rw-r--r--  1 cassandra dba  51M Mar  3 18:02 whatever-he-7953-Index.db***
> *
>
> -rw-r--r--  1 cassandra dba 4.3K Mar  3 18:02
> whatever-he-7953-Statistics.db
>
> -rw-r--r--  1 cassandra dba 231K Mar  3 20:06
> whatever-he-7974-CompressionInfo.db
>
> -rw-r--r--  1 cassandra dba 268M Mar  3 20:06 whatever-he-7974-Data.db
>
> -rw-r--r--  1 cassandra dba 483K Mar  3 20:06 whatever-he-7974-Filter.db**
> **
>
> -rw-r--r--  1 cassandra dba  12M Mar  3 20:06 whatever-he-7974-Index.db***
> *
>
> -rw-r--r--  1 cassandra dba 4.3K Mar  3 20:06
> whatever-he-7974-Statistics.db
>
> -rw-r--r--  1 cassandra dba 116K Mar  4 06:28
> whatever-he-8002-CompressionInfo.db
>
> -rw-r--r--  1 cassandra dba 146M Mar  4 06:28 whatever-he-8002-Data.db
>
> -rw-r--r--  1 cassandra dba 646K Mar  4 06:28 whatever-he-8002-Filter.db**
> **
>
> -rw-r--r--  1 cassandra dba  16M Mar  4 06:28 whatever-he-8002-Index.db***
> *
>
> -rw-r--r--  1 cassandra dba 4.3K Mar  4 06:28
> w

Re: Repair taking long time

2014-09-26 Thread Bryan Talbot

With a 4.5 TB table and just 4 nodes, repair will likely take forever for
any version.

-Bryan


On Fri, Sep 26, 2014 at 10:40 AM, Jonathan Haddad  wrote:

> Are you using Cassandra 2.0 & vnodes?  If so, repair takes forever.
> This problem is addressed in 2.1.
>
> On Fri, Sep 26, 2014 at 9:52 AM, Gene Robichaux
>  wrote:
> > I am fairly new to Cassandra. We have a 9 node cluster, 5 in one DC and
> 4 in
> > another.
> >
> >
> >
> > Running a repair on a large column family seems to be moving much slower
> > than I expect.
> >
> >
> >
> > Looking at nodetool compaction stats it indicates the Validation phase is
> > running that the total bytes is 4.5T (4505336278756).
> >
> >
> >
> > This is a very large CF. The process has been running for 2.5 hours and
> has
> > processed 71G (71950433062). That rate is about 28.4 GB per hour. At this
> > rate it will take 158 hours, just shy of 1 week.
> >
> >
> >
> > Is this reasonable? This is my first large repair and I am wondering if
> this
> > is normal for a CF of this size. Seems like a long time to me.
> >
> >
> >
> > Is it possible to tune this process to speed it up? Is there something
> in my
> > configuration that could be causing this slow performance? I am running
> > HDDs, not SSDs in a JBOD configuration.
> >
> >
> >
> >
> >
> >
> >
> > Gene Robichaux
> >
> > Manager, Database Operations
> >
> > Match.com
> >
> > 8300 Douglas Avenue I Suite 800 I Dallas, TX  75225
> >
> > Phone: 214-576-3273
> >
> >
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>

Re: OldGen saturation

2014-10-28 Thread Bryan Talbot

On Tue, Oct 28, 2014 at 9:02 AM, Adria Arcarons <
adria.arcar...@greenpowermonitor.com> wrote:

>  Hi,
>
> Hi

>
>
> We have about 50.000 CFs of varying size
>

>
>

>
> The writing test consists of a continuous flow of inserts. The inserts are
> done inside BATCH statements in groups of 1.000 to a single CF at a time to
> make them faster.
>

>
>
> The problem I’m experiencing is that, eventually, when the script has been
> running for almost 40mins, the heap gets saturated. OldGen gets full and
> then there is an intensive GC activity trying to free OldGen objects, but
> it can only free very little space in each pass. Then GC saturates the CPU.
> Here are the graphs obtained with VisualVM that show this behavior:
>
>
>
>
>
> My total heap size is 1GB and the the NewGen region of 256MB. The C* node
> has 4GB RAM. Intel Xeon CPU E5520 @
>

Without looking at your VM graphs, I'm going to go out on a limb here and
say that your host is woefully underpowered to host fifty-thousand column
families and batch writes of one-thousand statements.

A 1 GB java heap size is sometimes acceptable for a unit test or playing
around with but you can't actually expect it to be adequate for a load test
can you?

Every CF consumes some permanent heap space for its metadata. Too many CF
are a bad thing. You probably have ten times more CF than would be
recommended as an upper limit.

-Bryan

Re: new data not flushed to sstables

2014-11-03 Thread Bryan Talbot

On Mon, Nov 3, 2014 at 7:44 AM, Sebastian Martinka <
sebastian.marti...@mercateo.com> wrote:

>  System and Keyspace Information:
>
> 4 Nodes
>
>

> CREATE KEYSPACE restore_test WITH replication = {  'class':
> 'SimpleStrategy',
>
>   'replication_factor': '3'};
>
>
>
>
>
> I assumed,  that a flush write all data in the sstables and we can use it
> for backup and restore. Did I forget something or is my understanding
> wrong?
>
>
>
I think you forgot that with N=4 and RF=3 that each node will contain
approximately 75% of the data. From a quick eyeball check of the json-dump
you provided, it looks like partition-key values are contained on 3 nodes
and are absent from 1 which is exactly as expected.

-Bryan

Re: tuning concurrent_reads param

2014-11-06 Thread Bryan Talbot

On Wed, Nov 5, 2014 at 11:00 PM, Jimmy Lin  wrote:

> Sorry I have late follow up question 
>
> In the Cassandra.yaml file the concurrent_read section has the following
> comment:
>
> What does it mean by " the operations to enqueue low enough in the stack
> that the OS and drives can reorder them." ? how does it help making the
> system healthy?
>

The operating system, disk controllers, and disks themselves can merge and
reorder requests to optimize performance.

Here's a relevant page with some details if you're interested in more
http://www.makelinux.net/books/lkd2/ch13lev1sec5



> What really happen if we increase it to a too high value? (maybe affecting
> other read or write operation as it eat up all disk IO resource?)
>


Yes

-Bryan

Re: using cssandra cql with php

2014-03-04 Thread Bryan Talbot

I think the options for using CQL from PHP pretty much don't exist. Those
that do are very old, haven't been updated in months, and don't support
newer CQL features. Also I don't think any of them use the binary protocol
but use thrift instead.

>From what I can tell, you'll be stuck using old CQL features from
unmaintained client drivers -- probably better to not be using CQL and PHP
together since mixing them seems pretty bad right now.


-Bryan



On Sun, Jan 12, 2014 at 11:27 PM, Jason Wee  wrote:

> Hi,
>
> operating system should not be a matter right? You just need the cassandra
> client downloaded and use it to access cassandra node. PHP?
> http://wiki.apache.org/cassandra/ClientOptions perhaps you can package
> cassandra pdo driver into rpm?
>
> Jason
>
>
> On Mon, Jan 13, 2014 at 3:02 PM, Tim Dunphy  wrote:
>
>> Hey all,
>>
>>  I'd like to be able to make calls to the cassandra database using PHP.
>> I've taken a look around but I've only found solutions out there for Ubuntu
>> and other distros. But my environment is CentOS.  Are there any packages
>> out there I can install that would allow me to use CQL in my PHP code?
>>
>> Thanks
>> Tim
>>
>> --
>> GPG me!!
>>
>> gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
>>
>>
>

Re: Cassandra 2.0.7 always failes due to 'too may open files' error

2014-05-05 Thread Bryan Talbot

Running

#> cat /proc/$(cat /var/run/cassandra.pid)/limits

as root or your cassandra user will tell you what limits it's actually
running with.




On Sun, May 4, 2014 at 10:12 PM, Yatong Zhang  wrote:

> I am running 'repair' when the error occurred. And just a few days before
> I changed the compaction strategy to 'leveled'. don know if this helps
>
>
> On Mon, May 5, 2014 at 1:10 PM, Yatong Zhang  wrote:
>
>> Cassandra is running as root
>>
>> [root@storage5 ~]# ps aux | grep java
>>> root  1893 42.0 24.0 7630664 3904000 ? Sl   10:43  60:01 java
>>> -ea -javaagent:/mydb/cassandra/bin/../lib/jamm-0.2.5.jar
>>> -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities
>>> -XX:ThreadPriorityPolicy=42 -Xms3959M -Xmx3959M -Xmn400M
>>> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103
>>> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>>> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
>>> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
>>> -XX:+UseTLAB -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true
>>> -Dcom.sun.management.jmxremote.port=7199
>>> -Dcom.sun.management.jmxremote.ssl=false
>>> -Dcom.sun.management.jmxremote.authenticate=false
>>> -Dlog4j.configuration=log4j-server.properties
>>> -Dlog4j.defaultInitOverride=true -Dcassandra-pidfile=/var/run/cassandra.pid
>>> -cp
>>> /mydb/cassandra/bin/../conf:/mydb/cassandra/bin/../build/classes/main:/mydb/cassandra/bin/../build/classes/thrift:/mydb/cassandra/bin/../lib/antlr-3.2.jar:/mydb/cassandra/bin/../lib/apache-cassandra-2.0.7.jar:/mydb/cassandra/bin/../lib/apache-cassandra-clientutil-2.0.7.jar:/mydb/cassandra/bin/../lib/apache-cassandra-thrift-2.0.7.jar:/mydb/cassandra/bin/../lib/commons-cli-1.1.jar:/mydb/cassandra/bin/../lib/commons-codec-1.2.jar:/mydb/cassandra/bin/../lib/commons-lang3-3.1.jar:/mydb/cassandra/bin/../lib/compress-lzf-0.8.4.jar:/mydb/cassandra/bin/../lib/concurrentlinkedhashmap-lru-1.3.jar:/mydb/cassandra/bin/../lib/disruptor-3.0.1.jar:/mydb/cassandra/bin/../lib/guava-15.0.jar:/mydb/cassandra/bin/../lib/high-scale-lib-1.1.2.jar:/mydb/cassandra/bin/../lib/jackson-core-asl-1.9.2.jar:/mydb/cassandra/bin/../lib/jackson-mapper-asl-1.9.2.jar:/mydb/cassandra/bin/../lib/jamm-0.2.5.jar:/mydb/cassandra/bin/../lib/jbcrypt-0.3m.jar:/mydb/cassandra/bin/../lib/jline-1.0.jar:/mydb/cassandra/bin/../lib/json-simple-1.1.jar:/mydb/cassandra/bin/../lib/libthrift-0.9.1.jar:/mydb/cassandra/bin/../lib/log4j-1.2.16.jar:/mydb/cassandra/bin/../lib/lz4-1.2.0.jar:/mydb/cassandra/bin/../lib/metrics-core-2.2.0.jar:/mydb/cassandra/bin/../lib/netty-3.6.6.Final.jar:/mydb/cassandra/bin/../lib/reporter-config-2.1.0.jar:/mydb/cassandra/bin/../lib/servlet-api-2.5-20081211.jar:/mydb/cassandra/bin/../lib/slf4j-api-1.7.2.jar:/mydb/cassandra/bin/../lib/slf4j-log4j12-1.7.2.jar:/mydb/cassandra/bin/../lib/snakeyaml-1.11.jar:/mydb/cassandra/bin/../lib/snappy-java-1.0.5.jar:/mydb/cassandra/bin/../lib/snaptree-0.1.jar:/mydb/cassandra/bin/../lib/super-csv-2.1.0.jar:/mydb/cassandra/bin/../lib/thrift-server-0.3.3.jar
>>> org.apache.cassandra.service.CassandraDaemon
>>>
>>
>>
>>
>> On Mon, May 5, 2014 at 1:02 PM, Philip Persad wrote:
>>
>>> Have you tried running "ulimit -a" as the Cassandra user instead of as
>>> root? It is possible that your configured a high file limit for root but
>>> not for the user running the Cassandra process.
>>>
>>>
>>> On Sun, May 4, 2014 at 6:07 PM, Yatong Zhang wrote:
>>>
 [root@storage5 ~]# lsof -n | grep java | wc -l
> 5103
> [root@storage5 ~]# lsof | wc -l
> 6567


 It's mentioned in previous mail:)


 On Mon, May 5, 2014 at 9:03 AM, nash  wrote:

> The lsof command or /proc can tell you how many open files it has. How
> many is it?
>
> --nash
>


>>>
>>
>

Failed to mkdirs $HOME/.cassandra

2014-05-15 Thread Bryan Talbot

How should nodetool command be run as the user "nobody"?

The nodetool command fails with an exception if it cannot create a
.cassandra directory in the current user's home directory.

I'd like to schedule some nodetool commands to run with least privilege as
cron jobs. I'd like to run them as the "nobody" user -- which typically has
"/" as the home directory -- since that's what the user is typically used
for (minimum privileges).

None of the methods described in this JIRA actually seem to work (with
2.0.7 anyway) https://issues.apache.org/jira/browse/CASSANDRA-6475

Testing as a normal user with no write permissions to the home directory
(to simulate the nobody user)

[vagrant@local-dev ~]$ nodetool version
ReleaseVersion: 2.0.7
[vagrant@local-dev ~]$ rm -rf .cassandra/
[vagrant@local-dev ~]$ chmod a-w .

[vagrant@local-dev ~]$ nodetool flush my_ks my_cf
Exception in thread "main" FSWriteError in /home/vagrant/.cassandra
at
org.apache.cassandra.io.util.FileUtils.createDirectory(FileUtils.java:305)
at
org.apache.cassandra.utils.FBUtilities.getToolsOutputDirectory(FBUtilities.java:690)
at
org.apache.cassandra.tools.NodeCmd.printHistory(NodeCmd.java:1504)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1204)
Caused by: java.io.IOException: Failed to mkdirs /home/vagrant/.cassandra
... 4 more

[vagrant@local-dev ~]$ HOME=/tmp nodetool flush my_ks my_cf
Exception in thread "main" FSWriteError in /home/vagrant/.cassandra
at
org.apache.cassandra.io.util.FileUtils.createDirectory(FileUtils.java:305)
at
org.apache.cassandra.utils.FBUtilities.getToolsOutputDirectory(FBUtilities.java:690)
at
org.apache.cassandra.tools.NodeCmd.printHistory(NodeCmd.java:1504)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1204)
Caused by: java.io.IOException: Failed to mkdirs /home/vagrant/.cassandra
... 4 more

[vagrant@local-dev ~]$ env HOME=/tmp nodetool flush my_ks my_cf
Exception in thread "main" FSWriteError in /home/vagrant/.cassandra
at
org.apache.cassandra.io.util.FileUtils.createDirectory(FileUtils.java:305)
at
org.apache.cassandra.utils.FBUtilities.getToolsOutputDirectory(FBUtilities.java:690)
at
org.apache.cassandra.tools.NodeCmd.printHistory(NodeCmd.java:1504)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1204)
Caused by: java.io.IOException: Failed to mkdirs /home/vagrant/.cassandra
... 4 more

[vagrant@local-dev ~]$ env user.home=/tmp nodetool flush my_ks my_cf
Exception in thread "main" FSWriteError in /home/vagrant/.cassandra
at
org.apache.cassandra.io.util.FileUtils.createDirectory(FileUtils.java:305)
at
org.apache.cassandra.utils.FBUtilities.getToolsOutputDirectory(FBUtilities.java:690)
at
org.apache.cassandra.tools.NodeCmd.printHistory(NodeCmd.java:1504)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1204)
Caused by: java.io.IOException: Failed to mkdirs /home/vagrant/.cassandra
... 4 more

[vagrant@local-dev ~]$ nodetool -Duser.home=/tmp flush my_ks my_cf
Unrecognized option: -Duser.home=/tmp
usage: java org.apache.cassandra.tools.NodeCmd --host  
...

Re: Query first 1 columns for each partitioning keys in CQL?

2014-05-19 Thread Bryan Talbot

I think there are several issues in your schema and queries.

First, the schema can't efficiently return the single newest post for every
author. It can efficiently return the newest N posts for a particular
author.

On Fri, May 16, 2014 at 11:53 PM, 後藤 泰陽  wrote:

>
> But I consider LIMIT to be a keyword to limits result numbers from WHOLE
> results retrieved by the SELECT statement.
>


This is happening due to the incorrect use of minTimeuuid() function. All
of your created_at values are equal so you're essentially getting 2 (order
not defined) values that have the lowest created_at value.

The minTimeuuid() function is mean to be used in the WHERE clause of a
SELECT statement often with maxTimeuuid() to do BETWEEN sort of queries on
timeuuid values.




> The result with SELECT.. LIMIT is below. Unfortunately, This is not what I
> wanted.
> I wante latest posts of each authors. (Now I doubt if CQL3 can't represent
> it)
>
> cqlsh:blog_test> create table posts(
>  ... author ascii,
>  ... created_at timeuuid,
>  ... entry text,
>  ... primary key(author,created_at)
>  ... )WITH CLUSTERING ORDER BY (created_at DESC);
> cqlsh:blog_test>
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('john',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
> john');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('john',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
> john');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('mike',minTimeuuid('2013-02-02 10:00+'),'This is an old entry by
> mike');
> cqlsh:blog_test> insert into posts(author,created_at,entry) values
> ('mike',minTimeuuid('2013-03-03 10:00+'),'This is a new entry by
> mike');
> cqlsh:blog_test> select * from posts limit 2;
>
>  author | created_at   | entry
>
> +--+--
>mike | 1c4d9000-83e9-11e2-8080-808080808080 |  This is a new entry by
> mike
>mike | 4e52d000-6d1f-11e2-8080-808080808080 | This is an old entry by
> mike
>
>
>
>


To get most recent posts by a particular author, you'll need statements
more like this:

cqlsh:test> insert into posts(author,created_at,entry) values
('john',now(),'This is an old entry by john'); cqlsh:test> insert into
posts(author,created_at,entry) values ('john',now(),'This is a new entry by
john'); cqlsh:test> insert into posts(author,created_at,entry) values
('mike',now(),'This is an old entry by mike'); cqlsh:test> insert into
posts(author,created_at,entry) values ('mike',now(),'This is a new entry by
mike');

and then you can get posts by 'john' ordered by newest to oldest as:

cqlsh:test> select author, created_at, dateOf(created_at), entry from posts
where author = 'john' limit 2 ;

 author | created_at   | dateOf(created_at)   |
entry
+--+--+--
   john | 7cb1ac30-df85-11e3-bb46-4d2d68f17aa6 | 2014-05-19 11:43:36-0700 |
 This is a new entry by john
   john | 74bb6750-df85-11e3-bb46-4d2d68f17aa6 | 2014-05-19 11:43:23-0700 |
This is an old entry by john


-Bryan

Re: Best partition type for Cassandra with JBOD

2014-05-19 Thread Bryan Talbot

For XFS, using noatime and nodirtime isn't really useful either.

http://xfs.org/index.php/XFS_FAQ#Q:_Is_using_noatime_or.2Fand_nodiratime_at_mount_time_giving_any_performance_benefits_in_xfs_.28or_not_using_them_performance_decrease.29.3F




On Sat, May 17, 2014 at 7:52 AM, James Campbell <
ja...@breachintelligence.com> wrote:

>  Thanks for the thoughts!
> On May 16, 2014 4:23 PM, Ariel Weisberg  wrote:
>  Hi,
>
> Recommending nobarrier (mount option barrier=0) when you don't know if a
> non-volatile cache in play is probably not the way to go. A non-volatile
> cache will typically ignore write barriers if a given block device is
> configured to cache writes anyways.
>
> I am also skeptical you will see a boost in performance. Applications that
> want to defer and batch writes won't emit write barriers frequently and
> when they do it's because the data has to be there. Filesystems depend on
> write barriers although it is surprisingly hard to get a reordering that is
> really bad because of the way journals are managed.
>
> Cassandra uses log structured storage and supports asynchronous periodic
> group commit so it doesn't need to emit write barriers frequently.
>
> Setting read ahead to zero on an SSD is necessary to get the maximum
> number of random reads, but will also disable prefetching for sequential
> reads. You need a lot less prefetching with an SSD due to the much faster
> response time, but it's still many microseconds.
>
> Someone with more Cassandra specific knowledge can probably give better
> advice as to when a non-zero read ahead make sense with Cassandra. This is
> something may be workload specific as well.
>
> Regards,
>  Ariel
>
> On Fri, May 16, 2014, at 01:55 PM, Kevin Burton wrote:
>
> That and nobarrier… and probably noop for the scheduler if using SSD and
> setting readahead to zero...
>
>
>  On Fri, May 16, 2014 at 10:29 AM, James Campbell <
> ja...@breachintelligence.com> wrote:
>
>  Hi all—
>
>
>
> What partition type is best/most commonly used for a multi-disk JBOD setup
> running Cassandra on CentOS 64bit?
>
>
>
> The datastax production server guidelines recommend XFS for data
> partitions, saying, “Because Cassandra can use almost half your disk space
> for a single file, use XFS when using large disks, particularly if using a
> 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and
> essentially unlimited on 64-bit.”
>
>
>
> However, the same document also notes that “Maximum recommended capacity
> for Cassandra 1.2 and later is 3 to 5TB per node,” which makes me think
> >16TB file sizes would be irrelevant (especially when not using RAID to
> create a single large volume).  What has been the experience of this group?
>
>
>
> I also noted that the guidelines don’t mention setting noatime and
> nodiratime flags in the fstab for data volumes, but I wonder if that’s a
> common practice.
>
> James
>
>
>
>
> --
>
>
>  Founder/CEO Spinn3r.com
>  Location: *San Francisco, CA*
>  Skype: *burtonator*
>  blog: http://burtonator.wordpress.com
>  … or check out my Google+ 
> profile<https://plus.google.com/102718274791889610666/posts>
>  <http://spinn3r.com>
>  War is peace. Freedom is slavery. Ignorance is strength. Corporations
> are people.
>
>
>


-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

Re: Index with same Name but different keyspace

2014-05-19 Thread Bryan Talbot

On Mon, May 19, 2014 at 6:39 AM, mahesh rajamani
wrote:

> Sorry I just realized the table name in 2 schema are slightly different,
> but still i am not sure why i should not use same index name across
> different schema. Below is the instruction to reproduce.
>
>
> Created 2 keyspace using cassandra-cli
>
>
> [default@unknown] create keyspace keyspace1 with placement_strategy =
> 'org.apache.cassandra.locator.SimpleStrategy' and
> strategy_options={replication_factor:1};
>
> [default@unknown] create keyspace keyspace2 with placement_strategy =
> 'org.apache.cassandra.locator.SimpleStrategy' and
> strategy_options={replication_factor:1};
>
>
> Create table index using cqlsh as below:
>
>
> cqlsh> use keyspace1;
>
> cqlsh:keyspace1> CREATE TABLE table1 (version text, flag boolean, primary
> key (version));
>
> cqlsh:keyspace1> create index sversionindex on table1(flag);
>
> cqlsh:keyspace1> use keyspace2;
>
> cqlsh:keyspace2> CREATE TABLE table2 (version text, flag boolean, primary
> key (version));
>
> cqlsh:keyspace2> create index sversionindex on table2(flag);
>
> *Bad Request: Duplicate index name sversionindex*
>
>

Since index name is optional in the create index statement, you could just
omit it and let the system give it a unique name for you.

-Bryan

Re: Timeseries data

2013-03-27 Thread Bryan Talbot

In the worst case, that is possible, but compaction strategies try to
minimize the number of SSTables that a row appears in so a row being in ALL
SStables is not likely for most cases.

-Bryan



On Wed, Mar 27, 2013 at 12:17 PM, Kanwar Sangha  wrote:

>  Hi – I have a query on Read with Cassandra. We are planning to have
> dynamic column family and each column would be on based a timeseries. 
>
> ** **
>
> Inserting data — key => ‘xxx′, {column_name => TimeUUID(now),
> :column_value => ‘value’ }, {column_name => TimeUUID(now), :column_value =>
> ‘value’ },..
>
> ** **
>
> Now this key might be spread across multiple SSTables over a period of
> days. When we do a READ query to fetch say a slice of data from this row
> based on time X->Y , would it need to get data from ALL sstables ? 
>
> ** **
>
> Thanks,
>
> Kanwar
>
> ** **
>

Re: Cassandra services down frequently [Version 1.1.4]

2013-04-04 Thread Bryan Talbot

On Thu, Apr 4, 2013 at 1:27 AM,  wrote:

>
> After some time (1 hour / 2 hour) cassandra shut services on one or two
> nodes with follwoing errors;
>


Wonder what the workload and schema is like ...

We can see from below that you've tweaked and disabled many of the memory
"safety valve" and other memory related settings.  Those could be causing
issues too.



> hinted_handoff_throttle_delay_**in_ms: 0
> flush_largest_memtables_at: 1.0
> reduce_cache_sizes_at: 1.0
> reduce_cache_capacity_to: 0.6
> rpc_keepalive: true
> rpc_server_type: sync
> rpc_min_threads: 16
> rpc_max_threads: 2147483647
> in_memory_compaction_limit_in_**mb: 256
> compaction_throughput_mb_per_**sec: 16
> rpc_timeout_in_ms: 15000
> dynamic_snitch_badness_**threshold: 0.0
>

Re: Adding nodes in 1.2 with vnodes requires huge disks

2013-04-26 Thread Bryan Talbot

I believe that "nodetool rebuild" is used to add a new datacenter, not just
a new host to an existing cluster.  Is that what you ran to add the node?

-Bryan



On Fri, Apr 26, 2013 at 1:27 PM, John Watson  wrote:

> Small relief we're not the only ones that had this issue.
>
> We're going to try running a shuffle before adding a new node again...
> maybe that will help
>
> - John
>
>
> On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral <
> fsob...@igcorp.com.br> wrote:
>
>> I am using the same version and observed something similar.
>>
>> I've added a new node, but the instructions from Datastax did not work
>> for me. Then I ran "nodetool rebuild" on the new node. After finished this
>> command, it contained two times the load of the other nodes. Even when I
>> ran "nodetool cleanup" on the older nodes, the situation was the same.
>>
>> The problem only seemed to disappear when "nodetool repair" was applied
>> to all nodes.
>>
>> Regards,
>> Francisco Sobral.
>>
>>
>>
>>
>> On Apr 25, 2013, at 4:57 PM, John Watson  wrote:
>>
>> After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running
>> upgradesstables, I figured it would be safe to start adding nodes to the
>> cluster. Guess not?
>>
>> It seems when new nodes join, they are streamed *all* sstables in the
>> cluster.
>>
>>
>> https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png
>>
>> The gray the line machine ran out disk space and for some reason cascaded
>> into errors in the cluster about 'no host id' when trying to store hints
>> for it (even though it hadn't joined yet).
>> The purple line machine, I just stopped the joining process because the
>> main cluster was dropping mutation messages at this point on a few nodes
>> (and it still had dozens of sstables to stream.)
>>
>> I followed this:
>> http://www.datastax.com/docs/1.2/operations/add_replace_nodes
>>
>> Is there something missing in that documentation?
>>
>> Thanks,
>>
>> John
>>
>>
>>
>

Re: How much heap does Cassandra 1.1.11 really need ?

2013-05-03 Thread Bryan Talbot

It's true that a 16GB heap is generally not a good idea; however, it's not
clear from the data provided what problem you're trying to solve.

What is it that you don't like about the default settings?

-Bryan



On Fri, May 3, 2013 at 4:27 AM, Oleg Dulin  wrote:

> Here is my question. It can't possibly be a good set up to use 16gig heap
> space, but this is the best I can do. Setting it to default never worked
> well for me, setting it to 8g doesn't work well either. It can't keep up
> with flushing memtables. It is possibly that someone at some point may have
> broken something in the config files. If I were to look for hints there,
> what should I look at ?
>
> Look at my gc log from Cassandra:
>
> Starts off like this:
>
> 2013-04-29T08:53:44.548-0400: 5.386: [GC 1677824K->11345K(16567552K),
> 0.0509880 secs]
>2 2013-04-29T08:53:47.701-0400: 8.539: [GC 1689169K->42027K(16567552K),
> 0.1269180 secs]
>3 2013-04-29T08:54:05.361-0400: 26.199: [GC
> 1719851K->231763K(16567552K), 0.1436070 secs]
>4 2013-04-29T08:55:44.797-0400: 125.635: [GC
> 1909587K->1480096K(16567552K), 1.2626270 secs]
>5 2013-04-29T08:58:44.367-0400: 305.205: [GC
> 3157920K->2358588K(16567552K), 1.1198150 secs]
>6 2013-04-29T09:01:12.167-0400: 453.005: [GC
> 4036412K->3634298K(16567552K), 1.0098650 secs]
>7 2013-04-29T09:03:35.204-0400: 596.042: [GC
> 5312122K->4339703K(16567552K), 0.4597180 secs]
>8 2013-04-29T09:04:51.562-0400: 672.400: [GC
> 6017527K->4956381K(16567552K), 0.5361800 secs]
>9 2013-04-29T09:04:59.205-0400: 680.043: [GC
> 6634205K->5131825K(16567552K), 0.1741690 secs]
>   10 2013-04-29T09:05:06.638-0400: 687.476: [GC
> 6809649K->5027933K(16567552K), 0.0607470 secs]
>   11 2013-04-29T09:05:13.908-0400: 694.747: [GC
> 6705757K->5012439K(16567552K), 0.0624410 secs]
>   12 2013-04-29T09:05:20.909-0400: 701.747: [GC
> 6690263K->5039538K(16567552K), 0.0618750 secs]
>   13 2013-04-29T09:06:35.914-0400: 776.752: [GC
> 6717362K->5819204K(16567552K), 0.5738550 secs]
>   14 2013-04-29T09:08:05.589-0400: 866.428: [GC
> 7497028K->6678597K(16567552K), 0.6781900 secs]
>   15 2013-04-29T09:08:12.458-0400: 873.296: [GC
> 8356421K->6865736K(16567552K), 0.1423040 secs]
>   16 2013-04-29T09:08:18.690-0400: 879.529: [GC
> 8543560K->6742902K(16567552K), 0.0516470 secs]
>   17 2013-04-29T09:08:24.914-0400: 885.752: [GC
> 8420726K->6725877K(16567552K), 0.0517290 secs]
>   18 2013-04-29T09:08:31.008-0400: 891.846: [GC
> 8403701K->6741781K(16567552K), 0.0532540 secs]
>   19 2013-04-29T09:08:37.201-0400: 898.039: [GC
> 8419605K->6759614K(16567552K), 0.0563290 secs]
>   20 2013-04-29T09:08:43.493-0400: 904.331: [GC
> 8437438K->6772147K(16567552K), 0.0569580 secs]
>   21 2013-04-29T09:08:49.757-0400: 910.595: [GC
> 8449971K->6776883K(16567552K), 0.0558070 secs]
>   22 2013-04-29T09:08:55.973-0400: 916.812: [GC
> 8454707K->6789404K(16567552K), 0.0577230 secs]
>
> ……
>
>
> look what it is today:
>
> 41536 2013-05-03T07:17:13.519-0400: 339814.357: [GC
> 9178946K->9176740K(16567552K), 0.0265830 secs]
> 41537 2013-05-03T07:17:19.556-0400: 339820.394: [GC
> 10854564K->9178449K(16567552K)**, 0.0253180 secs]
> 41538 2013-05-03T07:17:24.390-0400: 339825.228: [GC
> 10856273K->9179073K(16567552K)**, 0.0266450 secs]
> 41539 2013-05-03T07:17:30.729-0400: 339831.567: [GC
> 10856897K->9178629K(16567552K)**, 0.0261150 secs]
> 41540 2013-05-03T07:17:35.584-0400: 339836.422: [GC
> 10856453K->9178586K(16567552K)**, 0.0250870 secs]
> 41541 2013-05-03T07:17:38.514-0400: 339839.352: [GC
> 10856410K->9179314K(16567552K)**, 0.0258120 secs]
> 41542 2013-05-03T07:17:43.200-0400: 339844.038: [GC
> 10857138K->9180160K(16567552K)**, 0.0250150 secs]
> 41543 2013-05-03T07:17:46.566-0400: 339847.404: [GC
> 10857984K->9179071K(16567552K)**, 0.0264420 secs]
> 41544 2013-05-03T07:17:52.913-0400: 339853.751: [GC
> 10856895K->9179870K(16567552K)**, 0.0262430 secs]
> 41545 2013-05-03T07:17:58.303-0400: 339859.141: [GC
> 10857694K->9179209K(16567552K)**, 0.0255130 secs]
> 41546 2013-05-03T07:18:03.427-0400: 339864.265: [GC
> 10857033K->9178316K(16567552K)**, 0.0263140 secs]
> 41547 2013-05-03T07:18:11.657-0400: 339872.495: [GC
> 10856140K->9178351K(16567552K)**, 0.0265340 secs]
> 41548 2013-05-03T07:18:17.429-0400: 339878.267: [GC
> 10856175K->9179067K(16567552K)**, 0.0254820 secs]
> 41549 2013-05-03T07:18:21.251-0400: 339882.089: [GC
> 10856891K->9179680K(16567552K)**, 0.0264210 secs]
> 41550 2013-05-03T07:18:25.062-0400: 339885.900: [GC
> 10857504K->9178985K(16567552K)**, 0.0267200 secs]
>
>
>
>
> --
> Regards,
> Oleg Dulin
> NYC Java Big Data Engineer
> http://www.olegdulin.com/
>
>
>

Re: Cassandra running High Load with no one using the cluster

2013-05-06 Thread Bryan Talbot

On Sat, May 4, 2013 at 9:22 PM, Aiman Parvaiz  wrote:

>
> When starting this cluster we set
> > JVM_OPTS="$JVM_OPTS -Xss1000k"
>
>
>

Why did you increase the stack-size to 5.5 times greater than recommended?
 Since each threads now uses 1000KB minimum just for the stack, a large
number of threads will use a large amount of memory.

-Bryan

Re: index_interval

2013-05-10 Thread Bryan Talbot

If off-heap memory (for indes samples, bloom filters, row caches, key
caches, etc) is exhausted, will cassandra experience a memory allocation
error and quit?  If so, are there plans to make the off-heap usage more
dynamic to allow less used pages to be replaced with "hot" data and the
paged-out / "cold" data read back in again on demand?

-Bryan

On Wed, May 8, 2013 at 4:24 PM, Jonathan Ellis  wrote:

> index_interval won't be going away, but you won't need to change it as
> often in 2.0: https://issues.apache.org/jira/browse/CASSANDRA-5521
>
> On Mon, May 6, 2013 at 12:27 PM, Hiller, Dean 
> wrote:
> > I heard a rumor that index_interval is going away?  What is the
> replacement for this?  (we have been having to play with this setting a lot
> lately as too big and it gets slow yet too small and cassandra uses way too
> much RAM…we are still trying to find the right balance with this setting).
> >
> > Thanks,
> > Dean
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced
>

Re: index_interval

2013-05-13 Thread Bryan Talbot

So will cassandra provide a way to limit its off-heap usage to avoid
unexpected OOM kills?  I'd much rather have performance degrade when 100%
of the index samples no longer fit in memory rather than the process being
killed with no way to stabilize it without adding hardware or removing data.

-Bryan

On Fri, May 10, 2013 at 7:44 PM, Edward Capriolo wrote:

> If you use your off heap memory linux has an OOM killer, that will kill a
> random tasks.
>
>
> On Fri, May 10, 2013 at 11:34 AM, Bryan Talbot wrote:
>
>> If off-heap memory (for indes samples, bloom filters, row caches, key
>> caches, etc) is exhausted, will cassandra experience a memory allocation
>> error and quit?  If so, are there plans to make the off-heap usage more
>> dynamic to allow less used pages to be replaced with "hot" data and the
>> paged-out / "cold" data read back in again on demand?
>>
>>
>>

Re: index_interval

2013-05-13 Thread Bryan Talbot

Maybe I should ask the question a different way.

Currently, if all index samples do not fit in the java heap the jvm will
eventually OOM and the process will crash.  The proposed change sounds like
it will move the index samples to off-heap storage but that if that can't
hold all samples, the process will be killed.

Can the index sample storage be treated more like key cache or row cache
where the total space used can be limited to something less than all
available system ram, and space is recycled using an LRU (or configurable)
algorithm?

-Bryan

On Mon, May 13, 2013 at 9:10 PM, Bryan Talbot wrote:

> So will cassandra provide a way to limit its off-heap usage to avoid
> unexpected OOM kills?  I'd much rather have performance degrade when 100%
> of the index samples no longer fit in memory rather than the process being
> killed with no way to stabilize it without adding hardware or removing data.
>
> -Bryan
>
>
> On Fri, May 10, 2013 at 7:44 PM, Edward Capriolo wrote:
>
>> If you use your off heap memory linux has an OOM killer, that will kill a
>> random tasks.
>>
>>
>> On Fri, May 10, 2013 at 11:34 AM, Bryan Talbot wrote:
>>
>>> If off-heap memory (for indes samples, bloom filters, row caches, key
>>> caches, etc) is exhausted, will cassandra experience a memory allocation
>>> error and quit?  If so, are there plans to make the off-heap usage more
>>> dynamic to allow less used pages to be replaced with "hot" data and the
>>> paged-out / "cold" data read back in again on demand?
>>>
>>>
>>>

Re: SSTable size versus read performance

2013-05-16 Thread Bryan Talbot

512 sectors for read-ahead.  Are your new fancy SSD drives using large
sectors?  If your read-ahead is really reading 512 x 4KB per random IO,
then that 2 MB per read seems like a lot of extra overhead.

-Bryan

On Thu, May 16, 2013 at 12:35 PM, Keith Wright  wrote:

> We actually have it set to 512.  I have tried decreasing my SSTable size
> to 5 MB and changing the chunk size to 8 kb
>
> From: Igor 
> Reply-To: "user@cassandra.apache.org" 
> Date: Thursday, May 16, 2013 1:55 PM
>
> To: "user@cassandra.apache.org" 
> Subject: Re: SSTable size versus read performance
>
> My 5 cents: I'd check blockdev --getra for data drives - too high values
> for readahead (default to 256 for debian) can hurt read performance.
>
>

Re: update does not apply to any replica if consistency = ALL and one replica is down

2013-05-17 Thread Bryan Talbot

I think you're conflating "may" with "must".  That article says that
updates "may" still be applied to some replicas when there is a failure and
I believe that still is the case.  However, if the coordinator knows that
the CL can't be met before even attempting the write, I don't think it will
attempt the write.

-Bryan



On Fri, May 17, 2013 at 1:48 AM, Sergey Naumov  wrote:

> As described here (
> http://maxgrinev.com/2010/07/12/update-idempotency-why-it-is-important-in-cassandra-applications-2/),
> if consistency level couldn't be met, updates are applied anyway on
> functional replicas, and they could be propagated later to other replicas
> using repair mechanisms or by issuing the same request later, as update
> operations are idempotent in Cassandra.
>
> But... on my configuration (Cassandra 1.2.4, python CQL 1.0.4, DC1 - 3
> nodes, DC2 - 3 nodes, DC3 - 1 node, RF={DC1:3, DC2:2, DC3:1}, Random
> Partitioner, GossipingPropertyFileSnitch, one node in DC1 is deliberately
> down - and, as RF for DC1 is 3, this down node is a replica node for 100%
> of records),  when I try to insert one record with consistency level of
> ALL, this insert does not appear on any replica (-s30 - is a serial of
> UUID1: 001e--1000--x (30 is 1e in hex), -n1 mean
> that we will insert/update a single record with first id from this series -
> 001e--1000--):
> *write with consistency ALL:*
> cassandra@host11:~/Cassandra$ ./insert.sh -s30 -n1 -cALL
> Traceback (most recent call last):
>   File "./aux/fastinsert.py", line 54, in insert
> curs.execute(cmd, consistency_level=p.conlvl)
> OperationalError: Unable to complete request: one or more nodes were
> unavailable.
> Last record UUID is 001e--1000--*
>
> *
> about 10 seconds passed...
> *
> read with consistency ONE:*
> cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cONE
> Total records read: *0*
> Last record UUID is 001e--1000--
> *read with consistency QUORUM:*
> cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cQUORUM
> Total records read: *0*
> Last record UUID is 001e--1000--
> *write with consistency QUORUM:*
> cassandra@host11:~/Cassandra$ ./insert.sh -s30 -n1 -cQUORUM
> Last record UUID is 001e--1000--
> *read with consistency QUORUM:*
> cassandra@host11:~/Cassandra$ ./select.sh -s30 -n1 -cQUORUM
> Total records read: *1*
> Last record UUID is 001e--1000--
>
> Is it a new feature of Cassandra that it does not perform a write to any
> replica if consistency couldn't be satisfied? If so, then is it true for
> all cases, for example when returned error is "OperationalError: Request
> did not complete within rpc_timeout"?
>
> Thanks in advance,
> Sergey Naumov.
>

Re: In a multiple data center setup, do all the data centers have complete data irrespective of RF?

2013-05-20 Thread Bryan Talbot

Option #3 since it depends on the placement strategy and not the
partitioner.

-Bryan



On Mon, May 20, 2013 at 6:24 AM, Pinak Pani <
nishant.has.a.quest...@gmail.com> wrote:

> I just wanted to verify the fact that if I happen to setup a multi
> data-center Cassandra setup, will each data center have the complete
> data-set with it?
>
> Say, I have two data-center each with two nodes, and a partitioner that
> ranges from 0 to 100. Initial token assigned this way
>
> DC1:N1 = 00
> DC2:N1 = 25
> DC1:N2 = 50
> DC2:N2 = 75
>
> where DCX is data center X, NX is node X. *Which one the following
> options is true?*
>
> *Option #1: *DC1 and DC2, each will hold complete dataset with keys
> bucketed as follows
> DC1:N1 = (50, 00] => 50 keys
> DC1:N2 = (00, 50] => 50 keys
> 
> Complete data set mirrored at DC1
>
> DC2:N1 = (75, 25] => 50 keys
> DC2:N2 = (25, 75] => 50 keys
> 
> Complete data set mirrored at DC2
>
> *Option #2: *DC1 and DC2, each will hold 50% of the data with keys
> bucketed as follows (much the same way in a single C setup)
> DC1:N1 = (75, 00] => 25 keys
> DC2:N1 = (00, 25] => 25 keys
> DC1:N2 = (25, 50] => 25 keys
> DC2:N2 = (50, 75] => 25 keys
> 
> data is divided into the two data centers.
>
> Thanks,
> PP
>

Re: In a multiple data center setup, do all the data centers have complete data irrespective of RF?

2013-05-20 Thread Bryan Talbot

On Mon, May 20, 2013 at 10:01 AM, Pinak Pani <
nishant.has.a.quest...@gmail.com> wrote:

> Assume NetworkTopologyStrategy. So, I wanted to know whether a data-center
> will contain all the keys?
>
> This is the case:
>
> CREATE KEYSPACE appKS
>   WITH placement_strategy = 'NetworkTopologyStrategy'
>   AND strategy_options={DC1:3, DC2:3};
>
> Does DC1 and DC2 each contain complete database corpus? That is, if DC1
> blows, will I get all the data from DC2? Assume RF = 1.
>

Your config sample isn't RF=1 it's RF=3.  That's what the DC1:3 and DC2:3
mean -- set RF=3 for DC1 and RF=3 for DC2 for all rows of all CFs in this
keyspace.

>
> Sorry, for the very elementary question. This is the post that made me ask
> this question:
> http://www.onsip.com/blog/2011/07/15/intro-to-cassandra-and-networktopologystrategy
>
> It says,
>
> "NTS creates an iterator for EACH datacenter and places writes discretely
> for each. The result is that NTS basically breaks each datacenter into it's
> own logical ring when it places writes."
>

A lot of things change in fast moving projects in 2 years, so you'll have
to take anything written 2 years ago with a grain of salt and figure out if
it's still true with whatever version you're using.

>
> That seems to mean that each data-center behaves as an independent ring
> with initial_token. So, If I have 2 data centers and NTS, I am basically
> mirroring the database. Right?
>
>

Depending on how you've configured your placement strategy, but if you're
using DC1:3 and DC2:3 like you have above, then yes, you'd expect to have 3
copies of every row in both data centers for that keyspace.

-Bryan

Re: data clean up problem

2013-05-28 Thread Bryan Talbot

I think what you're asking for (efficient removal of TTL'd write-once data)
is already in the works but not until 2.0 it seems.

https://issues.apache.org/jira/browse/CASSANDRA-5228

-Bryan



On Tue, May 28, 2013 at 1:26 PM, Hiller, Dean  wrote:

> Oh and yes, astyanax uses client side response latency and cassandra does
> the same as a client of the other nodes.
>
> Dean
>
> On 5/28/13 2:23 PM, "Hiller, Dean"  wrote:
>
> >Actually, we did a huge investigation into this on astyanax and cassandra.
> > Astyanax if I remember worked if configured correctly but casasndra did
> >not so we patched cassandraŠfor some reason cassandra once on the
> >co-ordinator who had one copy fo the data would wait for both other nodes
> >to respond even though we are CL=QUOROM on RF=3Šwe put in patch for that
> >which my teammate is still supposed to submit.  Cassandra should only wait
> >for one nodeŠat least I think that is how I remember itŠ.We have it in our
> >backlog to get the patch into cassandra.
> >
> >Previously one slow node would slow down our website but this no longer
> >happens to us such that when compaction kicks off on a single node, our
> >cluster keeps going strong.
> >
> >Dean
> >
> >On 5/28/13 2:12 PM, "Dwight Smith"  wrote:
> >
> >>How do you determine the slow node, client side response latency?
> >>
> >>-Original Message-
> >>From: Hiller, Dean [mailto:dean.hil...@nrel.gov]
> >>Sent: Tuesday, May 28, 2013 1:10 PM
> >>To: user@cassandra.apache.org
> >>Subject: Re: data clean up problem
> >>
> >>How much disk used on each node?  We run the suggested < 300G per node as
> >>above that compactions can have trouble keeping up.
> >>
> >>Ps. We run compactions during peak hours just fine because our client
> >>reroutes to the 2 of 3 nodes not running compactions based on seeing the
> >>slow node so performance stays fast.
> >>
> >>The easy route is to of course double your cluster and halve the data
> >>sizes per node so compaction can keep up.
> >>
> >>Dean
> >>
> >>From: cem mailto:cayiro...@gmail.com>>
> >>Reply-To: "user@cassandra.apache.org"
> >>mailto:user@cassandra.apache.org>>
> >>Date: Tuesday, May 28, 2013 1:45 PM
> >>To: "user@cassandra.apache.org"
> >>mailto:user@cassandra.apache.org>>
> >>Subject: Re: data clean up problem
> >>
> >>Thanks for the answer.
> >>
> >>Sorry for the misunderstanding. I tried to say I don't send delete
> >>request from the client so it safe to set gc_grace to 0. TTL is used for
> >>data clean up. I am not running a manual compaction. I tried that ones
> >>but it took a lot of time finish and I will not have this amount of
> >>off-peek time in the production to run this. I even set the compaction
> >>throughput to unlimited and it didnt help that much.
> >>
> >>Disk size just keeps on growing but I know that there is enough space to
> >>store 1 day data.
> >>
> >>What do you think about time rage partitioning? Creating new column
> >>family for each partition and drop when you know that all records are
> >>expired.
> >>
> >>I have 5 nodes.
> >>
> >>Cem.
> >>
> >>
> >>
> >>
> >>On Tue, May 28, 2013 at 9:37 PM, Hiller, Dean
> >>mailto:dean.hil...@nrel.gov>> wrote:
> >>Also, how many nodes are you running?
> >>
> >>From: cem
> >>mailto:cayiro...@gmail.com> cayiroglu@gmail.c
> >>o
> >>m>>
> >>Reply-To:
> >>"user@cassandra.apache.org user@
> >>c
> >>assandra.apache.org>"
> >>mailto:user@cassandra.apache.org> user@
> >>c
> >>assandra.apache.org>>
> >>Date: Tuesday, May 28, 2013 1:17 PM
> >>To:
> >>"user@cassandra.apache.org user@
> >>c
> >>assandra.apache.org>"
> >>mailto:user@cassandra.apache.org> user@
> >>c
> >>assandra.apache.org>>
> >>Subject: Re: data clean up problem
> >>
> >>Thanks for the answer but it is already set to 0 since I don't do any
> >>delete.
> >>
> >>Cem
> >>
> >>
> >>On Tue, May 28, 2013 at 9:03 PM, Edward Capriolo
> >>mailto:edlinuxg...@gmail.com> edlinuxguru@g
> >>m
> >>ail.com>> wrote:
> >>You need to change the gc_grace time of the column family. It defaults to
> >>10 days. By default the tombstones will not go away for 10 days.
> >>
> >>
> >>On Tue, May 28, 2013 at 2:46 PM, cem
> >>mailto:cayiro...@gmail.com> cayiroglu@gmail.c
> >>o
> >>m>> wrote:
> >>Hi Experts,
> >>
> >>
> >>We have general problem about cleaning up data from the disk. I need to
> >>free the disk space after retention period and the customer wants to
> >>dimension the disk space base on that.
> >>
> >>After running multiple performance tests with TTL of 1 day we saw that
> >>the compaction couldn't keep up with the request rate. Disks were getting
> >>full after 3 days. There were also a lot of sstables that are older th

Re: Cassandra performance decreases drastically with increase in data size.

2013-05-30 Thread Bryan Talbot

One or more of these might be effective depending on your particular usage

- remove data (rows especially)
- add nodes
- add ram (has limitations)
- reduce bloom filter space used by increasing fp chance
- reduce row and key cache sizes
- increase index sample ratio
- reduce compaction concurrency and throughput
- upgrade to cassandra 1.2 which does some of these things for you


-Bryan



On Thu, May 30, 2013 at 2:31 PM, srmore  wrote:

> You are right, it looks like I am doing a lot of GC. Is there any
> short-term solution for this other than bumping up the heap ? because, even
> if I increase the heap I will run into the same issue. Only the time before
> I hit OOM will be lengthened.
>
> It will be while before we go to latest and greatest Cassandra.
>
> Thanks !
>
>
>
> On Thu, May 30, 2013 at 12:05 AM, Jonathan Ellis wrote:
>
>> Sounds like you're spending all your time in GC, which you can verify
>> by checking what GCInspector and StatusLogger say in the log.
>>
>> Fix is increase your heap size or upgrade to 1.2:
>> http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2
>>
>> On Wed, May 29, 2013 at 11:32 PM, srmore  wrote:
>> > Hello,
>> > I am observing that my performance is drastically decreasing when my
>> data
>> > size grows. I have a 3 node cluster with 64 GB of ram and my data size
>> is
>> > around 400GB on all the nodes. I also see that when I re-start
>> Cassandra the
>> > performance goes back to normal and then again starts decreasing after
>> some
>> > time.
>> >
>> > Some hunting landed me to this page
>> > http://wiki.apache.org/cassandra/LargeDataSetConsiderations which talks
>> > about the large data sets and explains that it might be because I am
>> going
>> > through multiple layers of OS cache, but does not tell me how to tune
>> it.
>> >
>> > So, my question is, are there any optimizations that I can do to handle
>> > these large datatasets ?
>> >
>> > and why does my performance go back to normal when I restart Cassandra ?
>> >
>> > Thanks !
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder, http://www.datastax.com
>> @spyced
>>
>
>

Re: Multiple JBOD data directory

2013-06-05 Thread Bryan Talbot

If you're using cassandra 1.2 then you have a choice specified in the yaml


# policy for data disk failures:
# stop: shut down gossip and Thrift, leaving the node effectively dead, but
#   can still be inspected via JMX.
# best_effort: stop using the failed disk and respond to requests based on
#  remaining available sstables.  This means you WILL see
obsolete
#  data at CL.ONE!
# ignore: ignore fatal errors a


-Bryan



On Wed, Jun 5, 2013 at 6:11 AM, Christopher Wirt wrote:

> I would hope so. Just trying to get some confirmation from someone with
> production experience. 
>
> ** **
>
> Thanks for your reply
>
> ** **
>
> *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
> *Sent:* 05 June 2013 13:31
> *To:* user@cassandra.apache.org
> *Subject:* Re: Multiple JBOD data directory
>
> ** **
>
> Though, I am a newbie bust just had a thought regarding your question 'How
> will it handle requests for data which unavailable?', wouldn't the data be
> served in that case from other nodes where it has been replicated?
>
> ** **
>
> Regards,
>
> Shahab
>
> ** **
>
> On Wed, Jun 5, 2013 at 5:32 AM, Christopher Wirt 
> wrote:
>
> Hello, 
>
>  
>
> We’re thinking about using multiple data directories each with its own
> disk and are currently testing this against a RAID0 config.
>
>  
>
> I’ve seen that there is failure handling with multiple JBOD.
>
>  
>
> e.g. 
>
> We have two data directories mounted to separate drives
>
> /disk1
>
> /disk2
>
> One of the drives fails 
>
>  
>
> Will Cassandra continue to work?
>
> How will it handle requests for data which unavailable?
>
> If I want to add an additional drive what is the best way to go about
> redistributing the data? 
>
>  
>
> Thanks,
>
> Chris
>
> ** **
>

Re: Multiple JBOD data directory

2013-06-05 Thread Bryan Talbot

... sorry, message got cut off


# policy for data disk failures:
# stop: shut down gossip and Thrift, leaving the node effectively dead, but
#   can still be inspected via JMX.
# best_effort: stop using the failed disk and respond to requests based on
#  remaining available sstables.  This means you WILL see
obsolete
#  data at CL.ONE!
# ignore: ignore fatal errors and let requests fail, as in pre-1.2 Cassandra
disk_failure_policy: stop





On Wed, Jun 5, 2013 at 2:59 PM, Bryan Talbot  wrote:

> If you're using cassandra 1.2 then you have a choice specified in the yaml
>
>
> # policy for data disk failures:
> # stop: shut down gossip and Thrift, leaving the node effectively dead, but
> #   can still be inspected via JMX.
> # best_effort: stop using the failed disk and respond to requests based on
> #  remaining available sstables.  This means you WILL see
> obsolete
> #  data at CL.ONE!
> # ignore: ignore fatal errors a
>
>
> -Bryan
>
>
>
> On Wed, Jun 5, 2013 at 6:11 AM, Christopher Wirt wrote:
>
>> I would hope so. Just trying to get some confirmation from someone with
>> production experience. 
>>
>> ** **
>>
>> Thanks for your reply
>>
>> ** **
>>
>> *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
>> *Sent:* 05 June 2013 13:31
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Multiple JBOD data directory
>>
>> ** **
>>
>> Though, I am a newbie bust just had a thought regarding your question 'How
>> will it handle requests for data which unavailable?', wouldn't the data be
>> served in that case from other nodes where it has been replicated?
>>
>> ** **
>>
>> Regards,
>>
>> Shahab
>>
>> ** **
>>
>> On Wed, Jun 5, 2013 at 5:32 AM, Christopher Wirt 
>> wrote:
>>
>> Hello, 
>>
>>  
>>
>> We’re thinking about using multiple data directories each with its own
>> disk and are currently testing this against a RAID0 config.
>>
>>  
>>
>> I’ve seen that there is failure handling with multiple JBOD.
>>
>>  
>>
>> e.g. 
>>
>> We have two data directories mounted to separate drives
>>
>> /disk1
>>
>> /disk2
>>
>> One of the drives fails 
>>
>>  
>>
>> Will Cassandra continue to work?
>>
>> How will it handle requests for data which unavailable?
>>
>> If I want to add an additional drive what is the best way to go about
>> redistributing the data? 
>>
>>  
>>
>> Thanks,
>>
>> Chris
>>
>> ** **
>>
>
>

Re: [Cassandra] Conflict resolution in Cassandra

2013-06-06 Thread Bryan Talbot

For generic questions like this, google is your friend:
http://lmgtfy.com/?q=cassandra+conflict+resolution

-Bryan


On Thu, Jun 6, 2013 at 11:23 AM, Emalayan Vairavanathan <
svemala...@yahoo.com> wrote:

> Hi All,
>
> Can someone tell me about the conflict resolution mechanisms provided by
> Cassandra?
>
> More specifically does Cassandra provides a way to define application
> specific conflict resolution mechanisms (per row basis  / column basis)?
>or
> Does it automatically manage the conflicts based on some synchronization
> algorithms ?
>
>
> Thank you
> Emalayan
>
>
>

Re: Compaction not running

2013-06-18 Thread Bryan Talbot

Manual compaction for LCS doesn't really do much.  It certainly doesn't
compact all those little files into bigger files.  What makes you think
that compactions are not occurring?

-Bryan



On Tue, Jun 18, 2013 at 3:59 PM, Franc Carter wrote:

> On Sat, Jun 15, 2013 at 11:49 AM, Franc Carter 
> wrote:
>
>> On Sat, Jun 15, 2013 at 8:48 AM, Robert Coli wrote:
>>
>>> On Wed, Jun 12, 2013 at 3:26 PM, Franc Carter 
>>> wrote:
>>> > We are running a test system with Leveled compaction on
>>> Cassandra-1.2.4.
>>> > While doing an initial load of the data one of the nodes ran out of
>>> file
>>> > descriptors and since then it hasn't been automatically compacting.
>>>
>>> You have (at least) two options :
>>>
>>> 1) increase file descriptors available to Cassandra with ulimit, if
>>> possible
>>> 2) increase the size of your sstables with levelled compaction, such
>>> that you have fewer of them
>>>
>>
>> Oops, I wasn't clear enough.
>>
>> I have increased the number of file descriptors and no longer have a file
>> descriptor issue. However the node still doesn't compact automatically. If
>> I run a 'nodetool compact' it will do a small amount of compaction and then
>> stop. The Column Family is using LCS
>>
>
> Any ideas on this - compaction is still not automatically running for one
> of my nodes
>
> thanks
>
>
>>
>> cheers
>>
>>
>>>
>>> =Rob
>>>
>>
>>
>>
>> --
>>
>> *Franc Carter* | Systems architect | Sirca Ltd
>>  
>>
>> franc.car...@sirca.org.au | www.sirca.org.au
>>
>> Tel: +61 2 8355 2514
>>
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>>
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>>
>>
>
>
> --
>
> *Franc Carter* | Systems architect | Sirca Ltd
>  
>
> franc.car...@sirca.org.au | www.sirca.org.au
>
> Tel: +61 2 8355 2514
>
> Level 4, 55 Harrington St, The Rocks NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
>

Re: Heap is not released and streaming hangs at 0%

2013-06-21 Thread Bryan Talbot

bloom_filter_fp_chance = 0.7 is probably way too large to be effective and
you'll probably have issues compacting deleted rows and get poor read
performance with a value that high.  I'd guess that anything larger than
0.1 might as well be 1.0.

-Bryan



On Fri, Jun 21, 2013 at 5:58 AM, srmore  wrote:

>
> On Fri, Jun 21, 2013 at 2:53 AM, aaron morton wrote:
>
>> > nodetool -h localhost flush didn't do much good.
>>
>> Do you have 100's of millions of rows ?
>> If so see recent discussions about reducing the bloom_filter_fp_chance
>> and index_sampling.
>>
> Yes, I have 100's of millions of rows.
>
>
>>
>> If this is an old schema you may be using the very old setting of
>> 0.000744 which creates a lot of bloom filters.
>>
>> bloom_filter_fp_chance value that was changed from default to 0.1, looked
> at the filters and they are about 2.5G on disk and I have around 8G of heap.
> I will try increasing the value to 0.7 and report my results.
>
> It also appears to be a case of hard GC failure (as Rob mentioned) as the
> heap is never released, even after 24+ hours of idle time, the JVM needs to
> be restarted to reclaim the heap.
>
> Cheers
>>
>>-
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>>
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 20/06/2013, at 6:36 AM, Wei Zhu  wrote:
>>
>> If you want, you can try to force the GC through Jconsole.
>> Memory->Perform GC.
>>
>> It theoretically triggers a full GC and when it will happen depends on
>> the JVM
>>
>> -Wei
>>
>> --
>> *From: *"Robert Coli" 
>> *To: *user@cassandra.apache.org
>> *Sent: *Tuesday, June 18, 2013 10:43:13 AM
>> *Subject: *Re: Heap is not released and streaming hangs at 0%
>>
>> On Tue, Jun 18, 2013 at 10:33 AM, srmore  wrote:
>> > But then shouldn't JVM C G it eventually ? I can still see Cassandra
>> alive
>> > and kicking but looks like the heap is locked up even after the traffic
>> is
>> > long stopped.
>>
>> No, when GC system fails this hard it is often a permanent failure
>> which requires a restart of the JVM.
>>
>> > nodetool -h localhost flush didn't do much good.
>>
>> This adds support to the idea that your heap is too full, and not full
>> of memtables.
>>
>> You could try nodetool -h localhost invalidatekeycache, but that
>> probably will not free enough memory to help you.
>>
>> =Rob
>>
>>
>>
>

73 matches

Mail list logo