from:"Josh"

Row caching memory usage in Cassandra 1.0.x

2012-10-22 Thread Josh

Hi, I'm hoping to get some help on how to help tune our 1.0.x cluster w.r.t. 
row 
caching.

We're using the netflix priam client, so unfortunately upgrading to 1.1.x is 
out 
of the question for now.. but until we find a way around that, is there any way 
to help determine where the 'sweet spot' is between heap size, row cache size, 
and leaving the rest of the ram available to the OS?

We're using the oracle jvm with jna so we can do the off-heap row caching, but 
I'm not sure how to tell how much ram it's using, thus I'm not comfortable 
increasing it further. (currently we have it set to 100,000 rows and we're 
already seeing ~85% hit rates, so we've stopped upping it further for now).

Thanks for any advice,

-Josh

Re: Proxy instances?

2010-04-02 Thread Josh

Is the notion here that you'd run all writes/reads through that node
and let it decide where to get the data from?

I've been working on a C# client library and I've been picking a node
at random from the cluster and letting it figure things out.  Would a
setup like this be better?  Keep all the traffic load off of data
storing instances or would it be better to point at a load balancer
that does it?  Or is the carnival approach (Pick a node!  Any Node!)
better?

On Thu, Apr 1, 2010 at 6:19 PM, David King  wrote:
> Is it possible to have Cassandra instances that serve only as proxies to the 
> rest of the cluster, but have no storage themselves? Maybe with a keyspace 
> length of 0?

-- 
josh
@schulz
http://schulzone.org

RE: C* files getting stuck

2016-06-30 Thread Josh Smith

I have also faced this issue.  Rebooting the instance has been our fix so far.  
I am very interested if anyone else has a solution.  I was unable to get a 
definitive answer from Datastax during the last Cassandra Summit.

From: Amit Singh F [mailto:amit.f.si...@ericsson.com]
Sent: Thursday, June 30, 2016 7:02 AM
To: user@cassandra.apache.org
Subject: RE: C* files getting stuck

Hi All,

Please check fi anybody has faced below issue and if yes what best can be done 
to avoid this.?
Thanks in advance.

Regards
Amit Singh

From: Amit Singh F [mailto:amit.f.si...@ericsson.com]
Sent: Wednesday, June 29, 2016 3:52 PM
To: user@cassandra.apache.org
Subject: C* files getting stuck

Hi All

We are running Cassandra 2.0.14 and disk usage is very high. On investigating 
it further we found that there are around 4-5 files(~ 150 GB) in stuck mode.

Command Fired : lsof /var/lib/cassandra | grep -i deleted

Output :

java 12158 cassandra 308r REG 8,16 34396638044 12727268 
/var/lib/cassandra/data/mykeyspace/mycolumnfamily/mykeyspace-mycolumnfamily-jb-16481-Data.db
 (deleted)
java 12158 cassandra 327r REG 8,16 101982374806 12715102 
/var/lib/cassandra/data/mykeyspace/mycolumnfamily/mykeyspace-mycolumnfamily-jb-126861-Data.db
 (deleted)
java 12158 cassandra 339r REG 8,16 12966304784 12714010 
/var/lib/cassandra/data/mykeyspace/mycolumnfamily/mykeyspace-mycolumnfamily-jb-213548-Data.db
 (deleted)
java 12158 cassandra 379r REG 8,16 15323318036 12714957 
/var/lib/cassandra/data/mykeyspace/mycolumnfamily/mykeyspace-mycolumnfamily-jb-182936-Data.db
 (deleted)

we are not able to see these files in any directory. This is somewhat similar 
to  https://issues.apache.org/jira/browse/CASSANDRA-6275 which is fixed but 
still issue is there on higher version. Also in logs no error related to 
compaction is reported.

so could any one of you please provide any suggestion how to counter this. 
Restarting Cassandra is one solution but this issue keeps on occurring so we 
cannot restart production machine is not recommended so frequently.

Also we know that this version is not supported but there is high probability 
that it can occur in higher version too.
Regards
Amit Singh

Problems with schema creation

2016-09-19 Thread Josh Smith

I have an automated tool we created which will create a keyspace, its tables, 
and add indexes in solr.  But when I run the tool even for a new keyspace I end 
up getting ghost tables with the name “”.  If I look in system_schema.tables I 
see a bunch of tables all named 
(\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00). Am I 
creating the tables and schema too fast or is something else wrong? Has anyone 
else run into this problem before? I have searched the mailing list and google 
but I have not found anything.  I am running DSE 5.0 (C*3.0.2) on m4.4xl 5 
nodes currently.  Any help would be appreciated.

Josh Smith

JVM safepoints, mmap, and slow disks

2016-10-07 Thread Josh Snyder

ion:

I don't imagine there's an easy solution here. I plan to go ahead with
mitigation #1: "don't tolerate block devices that are slow", but I'd appreciate
any approach that doesn't require my hardware to be flawless all the time.

Josh

[1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1
[2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop

Re: JVM safepoints, mmap, and slow disks

2016-10-10 Thread Josh Snyder

On Sat, Oct 8, 2016 at 9:02 PM, Ariel Weisberg  wrote:
...

> You could use this to minimize the cost.
> http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652

That StackOverflow headline is interesting. Based on my reading of Hotspot's
code, it looks like sun.misc.unsafe is used under the hood to perform mmapped
I/O. I need to learn more about Hotspot's implementation before I can comment
further.

> Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
> with out prefetching though.

Not sure what you mean here. Aren't there going to be cache and TLB misses for
any I/O, whether via mmap or syscall?

> There is a system call to page the memory in which might be better for
> larger reads. Still no guarantee things stay cached though.

The approaches I've seen just involve something in userspace going through and
touching every desired page. It works, especially if you touch pages in
parallel.

Thanks for the pointers. If I get anywhere with them, I'll be sure to
let you know.

Josh

> On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>> I haven’t studied the read path that carefully, but there might be a spot at 
>> the C* level rather than JVM level where you could effectively do a JNI 
>> touch of the mmap region you’re going to need next.
>>
>>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
>>>
>>> We don’t use Azul’s Zing, but it does have the nice feature that all 
>>> threads don’t have to reach safepoints at the same time. That said we make 
>>> heavy use of Cassandra (with off heap memtables - not directly related but 
>>> allows us a lot more GC headroom) and SOLR where we switched to mmap 
>>> because it FAR out performed pread variants - in no cases have we noticed 
>>> long time to safe point (then again our IO is lightning fast).
>>>
>>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad  wrote:
>>>>
>>>> Linux automatically uses free memory as cache.  It's not swap.
>>>>
>>>> http://www.tldp.org/LDP/lki/lki-4.html
>>>>
>>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin  
>>>> wrote:
>>>>> __
>>>>> Sorry, I don't catch something. What page (memory) cache can exist if 
>>>>> there is no swap file.
>>>>> Where are those page written/read?
>>>>>
>>>>>
>>>>> Best regards, Vladimir Yudovin,
>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on 
>>>>> Azure and SoftLayer.
>>>>> Launch your cluster in minutes.
> *
>>>>>
>>>>>  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel 
>>>>> Weisberg* wrote 
>>>>>> Hi,
>>>>>>
>>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains using 
>>>>>> free memory a file cache. It uses free (and some of the time not so 
>>>>>> free!) memory to buffer writes and to cache recently written/read data.
>>>>>>
>>>>>> http://www.tldp.org/LDP/lki/lki-4.html
>>>>>>
>>>>>> When Linux decides it needs free memory it can either evict stuff from 
>>>>>> the page cache, flush dirty pages and then evict, or swap anonymous 
>>>>>> memory out. When you disable swap you only disable the last behavior.
>>>>>>
>>>>>> Maybe we are talking at cross purposes? What I meant is that increasing 
>>>>>> the heap size to reduce GC frequency is a legitimate thing to do and it 
>>>>>> does have an impact on the performance of the page cache even if you 
>>>>>> have swap disabled?
>>>>>>
>>>>>> Ariel
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>>>>>> >Page cache is data pending flush to disk and data cached from disk.
>>>>>>>
>>>>>>> Do you mean file cache?
>>>>>>>
>>>>>>>
>>>>>>> Best regards, Vladimir Yudovin,
>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra 
>>>>>>> on Azure and SoftLayer.
>>>>>>> Launch your cluster in minutes.*
>>>>>>>
>>>>>>>
>>>>>>>  On Sat, 08 Oct 2016 13:40:19 -0400 *Arie

Re: JVM safepoints, mmap, and slow disks

2016-10-10 Thread Josh Snyder

Do you know if there are any publicly available benchmarks on disk_access_mode,
preferably after the fix from CASSANDRA-10249?

If it turns out that syscall I/O is not significantly slower, I'd consider
switching. If I don't know the costs, I think I'd prefer to stick with the
devil I know how to mitigate (i.e. by policing by my block devices) rather than
switching to the devil that is non-standard and undocumented. :)

I may have time to do some benchmarking myself. If so, I'll be sure to inform
the list.

Josh

On Sun, Oct 9, 2016 at 2:39 AM, Benedict Elliott Smith
 wrote:
> The biggest problem with pread was the issue of over reading (reading 64k
> where 4k would suffice), which was significantly improved in 2.2 iirc. I
> don't think the penalty is very significant anymore, and if you are
> experiencing time to safe point issues it's very likely a worthwhile switch
> to flip.
>
>
> On Sunday, 9 October 2016, Graham Sanderson  wrote:
>>
>> I was using the term “touch” loosely to hopefully mean pre-fetch, though I
>> suspect (I think intel has been de-emphasizing) you can still do a sensible
>> prefetch instruction in native code. Even if not you are still better
>> blocking in JNI code - I haven’t looked at the link to see if the correct
>> barriers are enforced by the sun-misc-unsafe method.
>>
>> I do suspect that you’ll see up to about 5-10% sys call overhead if you
>> hit pread.
>>
>> > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg  wrote:
>> >
>> > Hi,
>> >
>> > This is starting to get into dev list territory.
>> >
>> > Interesting idea to touch every 4K page you are going to read.
>> >
>> > You could use this to minimize the cost.
>> >
>> > http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
>> >
>> > Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
>> > with out prefetching though.
>> >
>> > There is a system call to page the memory in which might be better for
>> > larger reads. Still no guarantee things stay cached though.
>> >
>> > Ariel
>> >
>> >
>> > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>> >> I haven’t studied the read path that carefully, but there might be a
>> >> spot at the C* level rather than JVM level where you could effectively do 
>> >> a
>> >> JNI touch of the mmap region you’re going to need next.
>> >>
>> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
>> >>>
>> >>> We don’t use Azul’s Zing, but it does have the nice feature that all
>> >>> threads don’t have to reach safepoints at the same time. That said we 
>> >>> make
>> >>> heavy use of Cassandra (with off heap memtables - not directly related 
>> >>> but
>> >>> allows us a lot more GC headroom) and SOLR where we switched to mmap 
>> >>> because
>> >>> it FAR out performed pread variants - in no cases have we noticed long 
>> >>> time
>> >>> to safe point (then again our IO is lightning fast).
>> >>>
>> >>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad 
>> >>>> wrote:
>> >>>>
>> >>>> Linux automatically uses free memory as cache.  It's not swap.
>> >>>>
>> >>>> http://www.tldp.org/LDP/lki/lki-4.html
>> >>>>
>> >>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin
>> >>>>  wrote:
>> >>>>> __
>> >>>>> Sorry, I don't catch something. What page (memory) cache can exist
>> >>>>> if there is no swap file.
>> >>>>> Where are those page written/read?
>> >>>>>
>> >>>>>
>> >>>>> Best regards, Vladimir Yudovin,
>> >>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
>> >>>>> Cassandra on Azure and SoftLayer.
>> >>>>> Launch your cluster in minutes.
>> > *
>> >>>>>
>> >>>>>  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel
>> >>>>> Weisberg* wrote 
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains
>> >>>>>> using free memory a

Re: JVM safepoints, mmap, and slow disks

2016-10-10 Thread Josh Snyder

That's a great idea. Even if the results were immediately thrown away,
pre-reading in a JNI method would eliminate cache misses with very high
probability. The only thing I'd worry about is the increased overhead of JNI
interfering with the fast path (cache hits). I don't have enough knowledge on
the read path or about JNI latency to comment on whether this concern is "real"
or not.

Josh

On Sat, Oct 8, 2016 at 5:21 PM, Graham Sanderson  wrote:
> I haven’t studied the read path that carefully, but there might be a spot at
> the C* level rather than JVM level where you could effectively do a JNI
> touch of the mmap region you’re going to need next.
>
> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
>
> We don’t use Azul’s Zing, but it does have the nice feature that all threads
> don’t have to reach safepoints at the same time. That said we make heavy use
> of Cassandra (with off heap memtables - not directly related but allows us a
> lot more GC headroom) and SOLR where we switched to mmap because it FAR out
> performed pread variants - in no cases have we noticed long time to safe
> point (then again our IO is lightning fast).
>
> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad  wrote:
>
> Linux automatically uses free memory as cache.  It's not swap.
>
> http://www.tldp.org/LDP/lki/lki-4.html
>
> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin 
> wrote:
>>
>> Sorry, I don't catch something. What page (memory) cache can exist if
>> there is no swap file.
>> Where are those page written/read?
>>
>>
>> Best regards, Vladimir Yudovin,
>> Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
>> Launch your cluster in minutes.
>>
>>
>>
>>  On Sat, 08 Oct 2016 14:09:50 -0400 Ariel Weisberg
>> wrote 
>>
>> Hi,
>>
>> Nope I mean page cache. Linux doesn't call the cache it maintains using
>> free memory a file cache. It uses free (and some of the time not so free!)
>> memory to buffer writes and to cache recently written/read data.
>>
>> http://www.tldp.org/LDP/lki/lki-4.html
>>
>> When Linux decides it needs free memory it can either evict stuff from the
>> page cache, flush dirty pages and then evict, or swap anonymous memory out.
>> When you disable swap you only disable the last behavior.
>>
>> Maybe we are talking at cross purposes? What I meant is that increasing
>> the heap size to reduce GC frequency is a legitimate thing to do and it does
>> have an impact on the performance of the page cache even if you have swap
>> disabled?
>>
>> Ariel
>>
>>
>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>
>> >Page cache is data pending flush to disk and data cached from disk.
>>
>> Do you mean file cache?
>>
>>
>> Best regards, Vladimir Yudovin,
>> Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
>> Launch your cluster in minutes.
>>
>>
>>  On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg 
>> wrote 
>>
>> Hi,
>>
>> Page cache is in use even if you disable swap. Swap is anonymous memory,
>> and whatever else the Linux kernel supports paging out. Page cache is data
>> pending flush to disk and data cached from disk.
>>
>> Given how bad the GC pauses are in C* I think it's not the high pole in
>> the tent. Until key things are off heap and C* can run with CMS and get 10
>> millisecond GCs all day long.
>>
>> You can go through tuning and hardware selection try to get more
>> consistent IO pauses and remove outliers as you mention and as a user I
>> think this is your best bet. Generally it's either bad device or filesystem
>> behavior if you get page faults taking more than 200 milliseconds O(G1 gc
>> collection).
>>
>> I think a JVM change to allow safe points around memory mapped file access
>> is really unlikely although I agree it would be great. I think the best hack
>> around it is to code up your memory mapped file access into JNI methods and
>> find some way to get that to work. Right now if you want to create a safe
>> point a JNI method is the way to do it. The problem is that JNI methods and
>> POJOs don't get along well.
>>
>> If you think about it the reason non-memory mapped IO works well is that
>> it's all JNI methods so they don't impact time to safe point. I think there
>> is a tradeoff between tolerance for outliers and performance.
>>
>> I don't know the state of the non-memory mapped path and how reliable that
>> is. If it were reliable and I couldn&

Schema Changes

2016-11-15 Thread Josh Smith

Would someone please explain how schema changes happen?
Here are some of the ring details
We have 5 nodes in 1 DC and 5 nodes in another DC across the country.
Here is our problem, we have a tool which automates our schema creation. Our 
schema consists of 7 keyspaces with 21 tables in each keyspace, so a total of 
147 tables are created at the initial provisioning.  During this schema 
creation we end up with system_schema keyspace corruption, we have found that 
it is due to schema version disagreement. To combat this we setup a wait until 
there is only one version in both system.local and system.peers tables.
The way I understand it schema changes are made on the local node only; changes 
are then propagated through either Thrift or Gossip, I could not find a 
definitive answer online if thrift or gossip was the carrier. So if I make all 
of the schema changes to one node it should propagate the changes to the other 
nodes one at a time. This is how I used to think that schema changes are 
propagated but we still get schema disagreement when changing the schema only 
on one node. Is the only option to introduce a wait after every table creation? 
 Should we be looking at another table besides system.local and peers? Any help 
would be appreciated.

Josh Smith

inconsistent results

2017-02-14 Thread Josh England

I'm running Cassandra 3.9 on CentOS 6.7 in a 6-node cluster.  I've got a
situation where the same query sometimes returns 2 records (correct), and
sometimes only returns 1 record (incorrect).  I've ruled out the
application and the indexing since this is reproducible directly from a
cqlsh shell with a simple select statement.  What is the best way to debug
what is happening here?

-JE

Re: inconsistent results

2017-02-14 Thread Josh England

All client interactions are from python (python-driver 3.7.1) using default
consistency (LOCAL_ONE I think).  Should I try repairing all nodes to make
sure all data is consistent?

-JE


On Tue, Feb 14, 2017 at 1:32 PM, Oskar Kjellin 
wrote:

> What consistency levels are you using for reads/writes?
>
> Sent from my iPhone
>
> > On 14 Feb 2017, at 22:27, Josh England  wrote:
> >
> > I'm running Cassandra 3.9 on CentOS 6.7 in a 6-node cluster.  I've got a
> situation where the same query sometimes returns 2 records (correct), and
> sometimes only returns 1 record (incorrect).  I've ruled out the
> application and the indexing since this is reproducible directly from a
> cqlsh shell with a simple select statement.  What is the best way to debug
> what is happening here?
> >
> > -JE
> >
>

Re: inconsistent results

2017-02-14 Thread Josh England

Super simple:
select * from table WHERE primary_key='foo';

-JE


On Tue, Feb 14, 2017 at 1:38 PM, sfesc...@gmail.com 
wrote:

> What is your query? I've seen this once when using secondary indices as it
> has to reach out to all nodes for the answer. If a node doesn't respond in
> time those records seem to get dropped.
>
> On Tue, Feb 14, 2017 at 1:37 PM Josh England  wrote:
>
>> All client interactions are from python (python-driver 3.7.1) using
>> default consistency (LOCAL_ONE I think).  Should I try repairing all nodes
>> to make sure all data is consistent?
>>
>> -JE
>>
>>
>> On Tue, Feb 14, 2017 at 1:32 PM, Oskar Kjellin 
>> wrote:
>>
>> What consistency levels are you using for reads/writes?
>>
>> Sent from my iPhone
>>
>> > On 14 Feb 2017, at 22:27, Josh England  wrote:
>> >
>> > I'm running Cassandra 3.9 on CentOS 6.7 in a 6-node cluster.  I've got
>> a situation where the same query sometimes returns 2 records (correct), and
>> sometimes only returns 1 record (incorrect).  I've ruled out the
>> application and the indexing since this is reproducible directly from a
>> cqlsh shell with a simple select statement.  What is the best way to debug
>> what is happening here?
>> >
>> > -JE
>> >
>>
>>
>>

Re: inconsistent results

2017-02-14 Thread Josh England

I'll try it the repair.  Using quorum tends to lead to too many timeout
problems though.  :(

-JE


On Tue, Feb 14, 2017 at 1:39 PM, Oskar Kjellin 
wrote:

> Repair might help. But you will end up in this situation again unless you
> read/write using quorum (may be local)
>
> Sent from my iPhone
>
> On 14 Feb 2017, at 22:37, Josh England  wrote:
>
> All client interactions are from python (python-driver 3.7.1) using
> default consistency (LOCAL_ONE I think).  Should I try repairing all nodes
> to make sure all data is consistent?
>
> -JE
>
>
> On Tue, Feb 14, 2017 at 1:32 PM, Oskar Kjellin 
> wrote:
>
>> What consistency levels are you using for reads/writes?
>>
>> Sent from my iPhone
>>
>> > On 14 Feb 2017, at 22:27, Josh England  wrote:
>> >
>> > I'm running Cassandra 3.9 on CentOS 6.7 in a 6-node cluster.  I've got
>> a situation where the same query sometimes returns 2 records (correct), and
>> sometimes only returns 1 record (incorrect).  I've ruled out the
>> application and the indexing since this is reproducible directly from a
>> cqlsh shell with a simple select statement.  What is the best way to debug
>> what is happening here?
>> >
>> > -JE
>> >
>>
>
>

Re: inconsistent results

2017-02-14 Thread Josh England

I'm sorry, yes.  The primary key is (foo_prefix, foo), with foo_prefix
being the partition key.  The query is:
select * from table WHERE foo_prefix='blah';

-JE

Re: inconsistent results

2017-02-14 Thread Josh England

I suspect this is true, but it has proven to be significantly harder to
track down.  Either cassandra is tickling some bug that nothing else does
or something strange is going on internally.  On an otherwise quiet system,
I'd see instant results most of the time intermixed with queries (reads)
that would timeout and fail.  I agree this needs to be addressed but I'd
like to understand what is currently going on with my queries.  If it is
thought to be a consistency problem, how can that be verified?

-JE


On Tue, Feb 14, 2017 at 1:46 PM, Jonathan Haddad  wrote:

> If you're getting a lot of timeouts you will almost certainly end up with
> consistency issues. You're going to need to fix the root cause, your
> cluster instability, or this sort of issue will be commonplace.
>
>
> On Tue, Feb 14, 2017 at 1:43 PM Josh England  wrote:
>
>> I'll try it the repair.  Using quorum tends to lead to too many timeout
>> problems though.  :(
>>
>> -JE
>>
>>
>> On Tue, Feb 14, 2017 at 1:39 PM, Oskar Kjellin 
>> wrote:
>>
>> Repair might help. But you will end up in this situation again unless you
>> read/write using quorum (may be local)
>>
>> Sent from my iPhone
>>
>> On 14 Feb 2017, at 22:37, Josh England  wrote:
>>
>> All client interactions are from python (python-driver 3.7.1) using
>> default consistency (LOCAL_ONE I think).  Should I try repairing all nodes
>> to make sure all data is consistent?
>>
>> -JE
>>
>>
>> On Tue, Feb 14, 2017 at 1:32 PM, Oskar Kjellin 
>> wrote:
>>
>> What consistency levels are you using for reads/writes?
>>
>> Sent from my iPhone
>>
>> > On 14 Feb 2017, at 22:27, Josh England  wrote:
>> >
>> > I'm running Cassandra 3.9 on CentOS 6.7 in a 6-node cluster.  I've got
>> a situation where the same query sometimes returns 2 records (correct), and
>> sometimes only returns 1 record (incorrect).  I've ruled out the
>> application and the indexing since this is reproducible directly from a
>> cqlsh shell with a simple select statement.  What is the best way to debug
>> what is happening here?
>> >
>> > -JE
>> >
>>
>>
>>
>>

Re: Cassandra 2.2.1 stuck at 100% on Windows

2015-10-16 Thread Josh McKenzie

One option: use process explorer to find out the TID's of the java process (
instructions
),
screen cap that, then also run jstack against the running cassandra process
out to a file a few times (instructions
).

We should be able to at least link up the TID to the hex thread # in the
jstack output to figure out who/what is spinning on there.

On Fri, Oct 16, 2015 at 1:28 PM, Michael Shuler 
wrote:

> On 10/16/2015 12:02 PM, Alaa Zubaidi (PDF) wrote:
>
>> No OOM in any of the log files, and NO long GC at that time.
>> I attached the last 2 minutes before it hangs until we restart cassandra
>> after hour an half.
>>
>
> Your logs show gossip issues with some seed nodes. `nodetool gossipinfo`
> on all nodes might be an interesting place to start.
>
> --
> Michael
>

RE: handling down node cassandra 2.0.15

2015-11-16 Thread Josh Smith

Sis you set the JVM_OPTS to replace address? That is usually the error I get 
when I forget to set the replace_address on Cassandra-env.

JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=address_of_dead_node

From: Anishek Agarwal [mailto:anis...@gmail.com]
Sent: Monday, November 16, 2015 9:25 AM
To: user@cassandra.apache.org
Subject: Re: handling down node cassandra 2.0.15

nope its not

On Mon, Nov 16, 2015 at 5:48 PM, sai krishnam raju potturi 
mailto:pskraj...@gmail.com>> wrote:

Is that a seed node?

On Mon, Nov 16, 2015, 05:21 Anishek Agarwal 
mailto:anis...@gmail.com>> wrote:
Hello,

We are having a 3 node cluster and one of the node went down due to a hardware 
memory failure looks like. We followed the steps below after the node was down 
for more than the default value of max_hint_window_in_ms

I tried to restart cassandra by following the steps @

  1.  
http://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_replace_node_t.html
  2.  
http://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
except the "clear data" part as it was not specified in second blog above.

i was trying to restart the same node that went down, however I did not get the 
messages in log files as stated in 2 against "StorageService"

instead it just tried to replay and then stopped with the error message as 
below:

ERROR [main] 2015-11-16 15:27:22,944 CassandraDaemon.java (line 584) Exception 
encountered during startup
java.lang.RuntimeException: Cannot replace address with a node that is already 
bootstrapped

Can someone please help me if there is something i am doing wrong here.

Thanks for the help in advance.

Regards,
Anishek

Re: [RELEASE] Apache Cassandra 3.1 released

2015-12-10 Thread Josh McKenzie

Kai,


> The most stable version will be 3.1 because it includes the critical fixes
> in 3.0.1 and some additional bug fixes

3.0.1 and 3.1 are identical. This is a unique overlap specific to 3.0.1 and
3.1.

To summarize, the most stable version should be x.Max(2n+1).z.

Going forward, you can expect the following:
3.2: new features
3.3: stabilization (built on top of 3.2)
3.4: new features
3.5: stabilization (built on top of 3.4)

And in parallel (for the 3.x major version / transition to tick-tock
transition period only):
3.0.2: bugfixes only
3.0.3: bugfixes only
3.0.4: bugfixes only
etc

*Any bugfix that goes into 3.0.X will be in the 3.X line, however not all
bugfixes in 3.X will be in 3.0.X* (bugfixes for new features introduced in
3.2, 3.4, etc will obviously not be back-ported to 3.0.X).

So, for the 3.x line:

   - If you absolutely must have the most stable version of C* and don't
   care at all about the new features introduced in even versions of 3.x, you
   want the 3.0.N release.
   - If you want access to the new features introduced in even release
   versions of 3.x (3.2, 3.4, 3.6), you'll want to run the latest odd version
   (3.3, 3.5, 3.7, etc) after the release containing the feature you want
   access to (so, if the feature's introduced in 3.4 and we haven't dropped
   3.5 yet, obviously you'd need to run 3.4).


This is only going to be the case during the transition phase from old
release cycles to tick-tock. We're targeting changes to CI and quality
focus going forward to greatly increase the stability of the odd releases
of major branches (3.1, 3.3, etc) so, for the 4.X releases, our
recommendation would be to run the highest # odd release for greatest
stability.

Hope that helps clarify.

On Thu, Dec 10, 2015 at 10:34 AM, Kai Wang  wrote:

> Paulo,
>
> Thank you for the examples.
>
> So if I go to download page and see 3.0.1, 3.1 and 3.2. The most stable
> version will be 3.1 because it includes the critical fixes in 3.0.1 and
> some additional bug fixes while doesn't have any new features introduced in
> 3.2. In that sense 3.0.1 becomes obsolete as soon as 3.1 comes out.
>
> To summarize, the most stable version should be x.Max(2n+1).z.
>
> Am I correct?
>
>
> On Thu, Dec 10, 2015 at 6:22 AM, Paulo Motta 
> wrote:
>
>> > Will 3.2 contain the bugfixes that are in 3.0.2 as well?
>>
>> If the bugfix affects both 3.2 and 3.0.2, yes. Otherwise it will only go
>> in the affected version.
>>
>> > Is 3.x.y just 3.0.x plus new stuff? Where most of the time y is 0,
>> unless there's a really serious issue that needs fixing?
>>
>> You can't really compare 3.0.y with 3.x(.y) because they're two different
>> versioning schemes.  To make it a bit clearer:
>>
>> Old model:
>> * x.y.z, where:
>>   * x.y represents the "major" version (eg: 2.1, 2.2)
>>   * z represents the "minor" version (eg: 2.1.1, 2.2.2)
>>
>> New model:
>> * a.b(.c), where:
>>   * a represents the "major" version (3, 4, 5)
>>   * b represents the "minor" version (3.1, 3.2, 4.1, etc), where:
>> * if b is even, it' a tick release, meaning it can contain both
>> bugfixes and new features.
>> * if b is odd, it's a tock release, meaning it can only contain
>> bugfixes.
>>   * c is a "subminor" optional version, which will only happen in
>> emergency situations, for example, if a critical/blocker bug is discovered
>> before the next release is out. So we probably won't have a 3.1.1, unless a
>> critical bug is discovered in 3.1 and needs urgent fix before 3.2.
>>
>> The 3.0.x series is an interim stabilization release using the old
>> versioning scheme, and will only receive bug fixes that affects it.
>>
>> 2015-12-09 18:21 GMT-08:00 Maciek Sakrejda :
>>
>>> I'm still confused, even after reading the blog post twice (and reading
>>> the linked Intel post). I understand what you are doing conceptually, but
>>> I'm having a hard time mapping that to actual planned release numbers.
>>>
>>> > The 3.0.2 will only contain bugfixes, while 3.2 will introduce new
>>> features.
>>>
>>>
>>>
>>
>

Re: CDC usability and future development

2018-01-31 Thread Josh McKenzie

>
> CDC provides only the mutation as opposed to the full column value, which
> tends to be of limited use for us. Applications might want to know the full
> column value, without having to issue a read back. We also see value in
> being able to publish the full column value both before and after the
> update. This is especially true when deleting a column since this stream
> may be joined with others, or consumers may require other fields to
> properly process the delete.


Philosophically, my first pass at the feature prioritized minimizing impact
to node performance first and usability second, punting a lot of the
de-duplication and RbW implications of having full column values, or
materializing stuff off-heap for consumption from a user and flagging as
persisted to disk etc, for future work on the feature. I don't personally
have any time to devote to moving the feature forward now but as Jeff
indicates, Jay and Simon are both active in the space and taking up the
torch.


On Tue, Jan 30, 2018 at 8:35 PM, Jeff Jirsa  wrote:

> Here's a deck of some proposed additions, discussed at one of the NGCC
> sessions last fall:
>
> https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf
>
>
>
> On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme  wrote:
>
> > Hi all,
> >
> > We are currently designing a system that allows our Cassandra clusters to
> > produce a stream of data updates. Naturally, we have been evaluating if
> CDC
> > can aid in this endeavor. We have found several challenges in using CDC
> for
> > this purpose.
> >
> > CDC provides only the mutation as opposed to the full column value, which
> > tends to be of limited use for us. Applications might want to know the
> full
> > column value, without having to issue a read back. We also see value in
> > being able to publish the full column value both before and after the
> > update. This is especially true when deleting a column since this stream
> > may be joined with others, or consumers may require other fields to
> > properly process the delete.
> >
> > Additionally, there is some difficulty with processing CDC itself such
> as:
> > - Updates not being immediately available (addressed by CASSANDRA-12148)
> > - Each node providing an independent streams of updates that must be
> > unified and deduplicated
> >
> > Our question is, what is the vision for CDC development? The current
> > implementation could work for some use cases, but is a ways from a
> general
> > streaming solution. I understand that the nature of Cassandra makes this
> > quite complicated, but are there any thoughts or desires on the future
> > direction of CDC?
> >
> > Thanks
> >
> >
>

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Josh McKenzie

There's a disheartening amount of "here's where Cassandra is bad, and
here's what it needs to do for me for free" happening in this thread.

This is open-source software. Everyone is *strongly encouraged* to submit a
patch to move the needle on *any* of these things being complained about in
this thread.

For the Apache Way  to work,
people need to step up and meaningfully contribute to a project to scratch
their own itch instead of just waiting for a random corporation-subsidized
engineer to happen to have interests that align with them and contribute
that to the project.

Beating a dead horse for things everyone on the project knows are serious
pain points is not productive.

On Wed, Feb 21, 2018 at 5:45 AM, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Mon, Feb 19, 2018 at 10:01 AM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
>
> >
> > >> Cluster wide management should be a big theme in any next major
> release.
> > >>
> > >Na. Stability and testing should be a big theme in the next major
> release.
> > >
> >
> > Double Na on that one Jeff.  I think you have a concern there about the
> > need to test sufficiently to ensure the stability of the next major
> > release.  That makes perfect sense.- for every release, especially the
> > major ones.  Continuous improvement is not a phase of development for
> > example.  CI should be in everything, in every phase.  Stability and
> > testing a part of every release not just one.  A major release should be
> a
> > nice step from the previous major release though.
> >
>
> I guess what Jeff refers to is the tick-tock release cycle experiment,
> which has proven to be a complete disaster by popular opinion.
>
> There's also the "materialized views" feature which failed to materialize
> in the end (pun intended) and had to be declared experimental
> retroactively.
>
> Another prominent example is incremental repair which was introduced as the
> default option in 2.2 and now is not recommended to use because of so many
> corner cases where it can fail.  So again experimental as an afterthought.
>
> Not to mention that even if you are aware of the default incremental and go
> with full repair instead, you're still up for a sad surprise:
> anti-compaction will be triggered despite the "full" repair.  Because
> anti-compaction is only disabled in case of sub-range repair (don't ask
> why), so you need to use something advanced like Reaper if you want to
> avoid that.  I don't think you'll ever find this in the documentation.
>
> Honestly, for an eventually-consistent system like Cassandra anti-entropy
> repair is one of the most important pieces to get right.  And Cassandra
> fails really badly on that one: the feature is not really well designed,
> poorly implemented and under-documented.
>
> In a summary, IMO, Cassandra is a poor implementation of some good ideas.
> It is a collection of hacks, not features.  They sometimes play together
> accidentally, and rarely by design.
>
> Regards,
> --
> Alex
>

Re: Filling in the blank To Do sections on the Apache Cassandra web site

2018-02-27 Thread Josh McKenzie

Might help, organizationally, to put all these efforts under a single
ticket of "Improve web site Documentation" and add these as sub-tasks.
Should be able to do that translation post-creation (i.e. in its current
state) if that's something that makes sense to you.

On Mon, Feb 26, 2018 at 5:24 PM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

> Here are the related JIRA’s.  Please add content even if It’s not well
> formed compositionally.  Myself or someone else will take it from there.
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-14274  The
> troubleshooting section of the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14273  The Bulk Loading
> web page on the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14272  The Backups web
> page on the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14271  The Hints web page
> in the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14270  The Read repair
> web page is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14269  The Data Modeling
> section of the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14268  The
> Architecture:Guarantees web page is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14267  The Dynamo web
> page on the Apache Cassandra site is missing content
>
> https://issues.apache.org/jira/browse/CASSANDRA-14266  The Architecture
> Overview web page on the Apache Cassandra site is empty
>
>
>
> Thanks for pitching in.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID]
> *Sent:* Monday, February 26, 2018 1:54 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
>
>
> Nice!  Thanks for the help Oliver!
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Oliver Ruebenacker [mailto:cur...@gmail.com]
> *Sent:* Sunday, February 25, 2018 7:12 AM
> *To:* user@cassandra.apache.org
> *Cc:* d...@cassandra.apache.org
> *Subject:* Re: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
>
>
>
>
>  Hello,
>
>   I have some slides about Cassandra
> ,
> feel free to borrow.
>
>  Best, Oliver
>
>
>
> On Fri, Feb 23, 2018 at 7:28 PM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
>
> These nine web pages on the Apache Cassandra web site have blank To Do
> sections.  Most of the web pages are completely blank.  Mind you there is a
> lot of hard work already done on the documentation.  I’ll make JIRA’s for
> any of the blank sections where there is not already a JIRA.  Then it will
> be on to writing up those sections.  *If you have any text to help me get
> started for any of these sections that would be really cool. *
>
>
>
> http://cassandra.apache.org/doc/latest/architecture/overview.html
>
>
>
> http://cassandra.apache.org/doc/latest/architecture/dynamo.html
>
>
>
> http://cassandra.apache.org/doc/latest/architecture/guarantees.html
>
>
>
> http://cassandra.apache.org/doc/latest/data_modeling/index.html
>
>
>
> http://cassandra.apache.org/doc/latest/operating/read_repair.html
>
>
>
> http://cassandra.apache.org/doc/latest/operating/hints.html
>
>
>
> http://cassandra.apache.org/doc/latest/operating/backups.html
>
>
>
> http://cassandra.apache.org/doc/latest/operating/bulk_loading.html
>
>
>
> http://cassandra.apache.org/doc/latest/troubleshooting/index.html
>
>
>
> Kenneth Brotman
>
>
>
>
>
>
> --
>
> Oliver Ruebenacker
>
> Senior Software Engineer, Diabetes Portal
> , Broad Institute
> 
>
>
>

RE: opscenter with community cassandra

2014-10-28 Thread Josh Smith

Yes Opscenter does work with the opensource version of Cassandra. I am 
currently running both in the cloud and our private datacenter with no 
problems. I have not tried 2.1.1 yet but I do not see why it wouldn’t work also.

Josh

From: Tim Dunphy [mailto:bluethu...@gmail.com]
Sent: Tuesday, October 28, 2014 10:43 AM
To: user@cassandra.apache.org
Subject: opscenter with community cassandra

Hey all,

 I'd like to setup datastax opscenter to monitor my cassandra ring. However I'm 
using the open source version of 2.1.1. And before I expend any time and effort 
in setting this up, I'm wondering if it will work with the open source version? 
Or would I need to be running datastax cassandra in order to get this going?

Thanks
Tim

--
GPG me!!

gpg --keyserver pool.sks-keyservers.net<http://pool.sks-keyservers.net> 
--recv-keys F186197B

Re: Cassandra 2.1.3, Windows 7 clear snapshot

2015-02-26 Thread Josh McKenzie

This should be fixed in 3.0 by a combination of
https://issues.apache.org/jira/browse/CASSANDRA-8709 and
https://issues.apache.org/jira/browse/CASSANDRA-4050.

The changes in 8709 and 4050 are invasive enough that we didn't want to
target them for the 2.1 release and is actually a big part of why we
consider 2.1.X beta on Windows.

On Thu, Feb 26, 2015 at 1:48 AM, Fredrik Larsson Stigbäck <
fredrik.l.stigb...@sitevision.se> wrote:

> What is the current status of clearing snapshots on Windows?
> When running Cassandra 2.1.3, trying manually to run clearSnapshot I get:
> "FSWriteError… Caused by: java.nio.file.FileSystemException… File is used
> by another process"
>
> I know there’s been numerous issues in JIRA trying to fix similar problems
> e.g.
> https://issues.apache.org/jira/browse/CASSANDRA-6283
>
> Are there any outstanding issues in 2.1.3 which specifically pinpoints
> manually clearing snapshots on Windows?
>
> Regards
> Fredrik
>

-- 
Joshua McKenzie
DataStax -- The Apache Cassandra Company

Apache Cassandra 2.2.0-rc1: calling all Windows users

2015-06-09 Thread Josh McKenzie

With the upcoming release of Cassandra-2.2 Windows is finally an officially
supported operating system. While many months of JIRA tickets
,
bug fixes, and contributions have gone into making Cassandra on Windows as
seamless of an experience as possible, we need your help as the community
to further kick the tires and let us know if you run into any problems.

Please let us know  if
you find anything that's not working on the platform so we can keep
Cassandra on Windows running strong!

-- 
Joshua McKenzie

Schema disagreement under normal conditions, ALTER TABLE hangs

2013-11-25 Thread Josh Dzielak

Recently we had a strange thing happen. Altering schema (gc_grace_seconds) for 
a column family resulted in a schema disagreement. 3/4 of nodes got it, 1/4 
didn't. There was no partition at the time, nor was there multiple schema 
updates issued. Going to the nodes with stale schema and trying to do the ALTER 
TABLE there resulted in hanging. We were eventually able to get schema 
agreement by restarting nodes, but both the initial disagreement under normal 
conditions and the hanging ALTER TABLE seem pretty weird. Any ideas here? Sound 
like a bug?  

We're on 1.2.8.

Thanks,
Josh

--
Josh Dzielak • Keen IO • @dzello (https://twitter.com/dzello)

Re: Schema disagreement under normal conditions, ALTER TABLE hangs

2013-11-28 Thread Josh Dzielak

Thanks Rob. Let me add one thing in case someone else finds this thread - 

Restarting the nodes did not in and of itself get the schema disagreement 
resolved. We had to run the ALTER TABLE command individually on each of the 
disagreeing nodes once they came back up. 

On Tuesday, November 26, 2013 at 11:24 AM, Robert Coli wrote:

> On Mon, Nov 25, 2013 at 6:42 PM, Josh Dzielak  (mailto:j...@keen.io)> wrote:
> > Recently we had a strange thing happen. Altering schema (gc_grace_seconds) 
> > for a column family resulted in a schema disagreement. 3/4 of nodes got it, 
> > 1/4 didn't. There was no partition at the time, nor was there multiple 
> > schema updates issued. Going to the nodes with stale schema and trying to 
> > do the ALTER TABLE there resulted in hanging. We were eventually able to 
> > get schema agreement by restarting nodes, but both the initial disagreement 
> > under normal conditions and the hanging ALTER TABLE seem pretty weird. Any 
> > ideas here? Sound like a bug? 
> 
> Yes, that sounds like a bug. This behavior is less common in 1.2.x than it 
> was previously, but still happens sometimes. It's interesting that restarting 
> the affected node helped, in previous versions of "hung schema" issue, it 
> would survive restart. 
>  
> > We're on 1.2.8.
> > 
> 
> 
> Unfortunately, unless you have a repro path, it is probably not worth 
> reporting a JIRA. 
> 
> =Rob
>  
> 
> 
> 
> 
>

sstable2json hangs for authenticated keyspace?

2013-11-29 Thread Josh Dzielak

Having an issue with sstable2json. It appears to hang when I run it against an 
SSTable that's part of a keyspace with authentication turned on. Running it 
against any other keyspace works, and as far as I can tell the only difference 
between the keyspaces is authentication. Has anyone run into this?

Thanks,
Josh

Re: sstable2json hangs for authenticated keyspace?

2013-12-04 Thread Josh Dzielak

Thanks Rob. Bug filed. 

https://issues.apache.org/jira/browse/CASSANDRA-6450 


On Monday, December 2, 2013 at 1:06 PM, Robert Coli wrote:

> On Fri, Nov 29, 2013 at 4:11 PM, Josh Dzielak  (mailto:j...@keen.io)> wrote:
> > Having an issue with sstable2json. It appears to hang when I run it against 
> > an SSTable that's part of a keyspace with authentication turned on. Running 
> > it against any other keyspace works, and as far as I can tell the only 
> > difference between the keyspaces is authentication. Has anyone run into 
> > this? 
> 
> This is probably a bug. I would file a JIRA. It's unclear whether this 
> "should" work, but I can't see any reason why not. At very least, it should 
> do something other than hang forever. :) 
> 
> =Rob
>  
> 
> 
> 
> 
>

Notes and questions from performing a large delete

2013-12-04 Thread Josh Dzielak

We recently had a little Cassandra party I wanted to share and see if anyone 
has notes to compare. Or can tell us what we did wrong or what we could do 
better. :) Apologies in advance for the length of the narrative here.

Task at hand: Delete about 50% of the rows in a large column family (~8TB) to 
reclaim some disk. These are rows are used only for intermediate storage.

Sequence of events:

- Issue the actual deletes. This, obviously, was super-fast.
- Nothing happens yet, which makes sense. New tombstones are not immediately 
compacted b/c of gc_grace_seconds.
- Adjust gc_grace_seconds down to 60 from 86400 using ALTER TABLE in CQL.

- Every node started working very hard. We saw disk space start to free up. It 
was exciting.
- Eventually the compactions finished and we had gotten a ton of disk back. 
- However, our SSTables were now 5Mb, not 256Mb as they had always been :(
- We inspected the schema in CQL/Opscenter etc and sure enough 
sstable_size_in_mb had changed to 5Mb for this CF. Previously all CFs were set 
at 256Mb, and all other CF's still were.

- At 5Mb we had a huge number of SSTables. Our next goal was to get these 
tables back to 256Mb.
- First step was to update the schema back to 256Mb.
- Figuring out how to do this in CQL was tricky, because CQL has gone through a 
lot of changes recently and getting the docs for your version is hard. 
Eventually we figured it out - ALTER TABLE events WITH 
compaction={'class':'LeveledCompactionStrategy','sstable_size_in_mb':256};
- Out of our 12 nodes, 9 acknowledged the update. The others showed the old 
schema still.
- The remaining 3 would not. There was no extra load was on the systems, 
operational status was very clean. All nodes could see each other.
- For each of the remaining 3 we tried to update the schema through a local 
cqlsh session. The same ALTER TABLE would just hang forever.
- We restarted Cassandra on each of the 3 nodes, then did the ALTER TABLE 
again. It worked this time. We finally had schema agreement.

- Starting with just 1 node, we kicked off upgradesstables, hoping it would 
rebuild the 5Mb tables to 256Mb tables.
- Nothing happened. This was (afaik) because the sstable size change doesn't 
represent a new version of schema for the sstables. So existing tables are 
ignored.
- We discovered the "-a" option for upgradesstables, which tells it to skip the 
schema check just and just do all the tables anyway.
- We ran upgradesstables -a and things started happening. After a few hours the 
pending compactions finished.
- Sadly, this node was now using 3x the disk it previously had. Some sstables 
were now 256Mb, but not all. There were tens of thousands of ~20Mb tables.
- A direct comparison to other nodes owning the same % of the ring showed both 
the same number of sstables and the same ratio of 256Mb+ tables to small 
tables. However, on a 'normal' node the small tables were all 5-6Mb and on the 
fat, upgraded node, all the tables were 20Mb+. This was why the fat node was 
taking up 3x disk overall.
- I tried to see what was in those 20Mb files relative to the 5Mb ones but 
sstable2json failed against our authenticated keyspace. I filed a bug 
(https://issues.apache.org/jira/browse/CASSANDRA-6450). 
- Had little choice here. We shut down the fat node, did a manual delete of 
sstables, brought it back up and did a repair. It came back to the right size.

TL;DR / Our big questions are:
How could the schema have spontaneously changed from 256Mb sstable_size_in_mb 
to 5Mb?
How could schema propagation failed such that only 9 of 12 nodes got the change 
even when cluster was healthy? Why did updating schema locally hang until 
restart?
What could have happened inside of upgradesstables that created the node with 
the same ring % but 3x disk load?

We're on Cassandra 1.2.8, Java 6, Ubuntu 12. Running on SSD's, 12 node cluster 
across 2 DCs. No compression, leveled compaction. Happy to provide more 
details. Thanks in advance for any insights into what happened or any best 
practices we missed during this episode.

Best,
Josh

Re: Notes and questions from performing a large delete

2013-12-07 Thread Josh Dzielak

Thanks Nate. I hadn't noticed that and it definitely explains it.

It'd be nice to see that called out much more clearly. As we found out the 
implications can be severe!

-Josh 


On Thursday, December 5, 2013 at 11:30 AM, Nate McCall wrote:

> Per the 256mb to 5mb change, check the very last section of this page:
> http://www.datastax.com/documentation/cql/3.0/webhelp/cql/cql_reference/alter_table_r.html
> 
> "Changing any compaction or compression option erases all previous compaction 
> or compression settings."
> 
> In other words, you have to include the whole 'WITH' clause each time - in 
> the future just grab the output from 'show schema' and add/modify as needed. 
> 
> I did not know this either until it happened to me as well - could probably 
> stand to be a little bit more front-and-center, IMO. 
> 
> 
> On Wed, Dec 4, 2013 at 2:59 PM, Josh Dzielak  (mailto:j...@keen.io)> wrote:
> > We recently had a little Cassandra party I wanted to share and see if 
> > anyone has notes to compare. Or can tell us what we did wrong or what we 
> > could do better. :) Apologies in advance for the length of the narrative 
> > here. 
> > 
> > Task at hand: Delete about 50% of the rows in a large column family (~8TB) 
> > to reclaim some disk. These are rows are used only for intermediate storage.
> > 
> > Sequence of events: 
> > 
> > - Issue the actual deletes. This, obviously, was super-fast.
> > - Nothing happens yet, which makes sense. New tombstones are not 
> > immediately compacted b/c of gc_grace_seconds.
> > - Adjust gc_grace_seconds down to 60 from 86400 using ALTER TABLE in CQL.
> > 
> > - Every node started working very hard. We saw disk space start to free up. 
> > It was exciting.
> > - Eventually the compactions finished and we had gotten a ton of disk back. 
> > - However, our SSTables were now 5Mb, not 256Mb as they had always been :(
> > - We inspected the schema in CQL/Opscenter etc and sure enough 
> > sstable_size_in_mb had changed to 5Mb for this CF. Previously all CFs were 
> > set at 256Mb, and all other CF's still were.
> > 
> > - At 5Mb we had a huge number of SSTables. Our next goal was to get these 
> > tables back to 256Mb. 
> > - First step was to update the schema back to 256Mb.
> > - Figuring out how to do this in CQL was tricky, because CQL has gone 
> > through a lot of changes recently and getting the docs for your version is 
> > hard. Eventually we figured it out - ALTER TABLE events WITH 
> > compaction={'class':'LeveledCompactionStrategy','sstable_size_in_mb':256};
> > - Out of our 12 nodes, 9 acknowledged the update. The others showed the old 
> > schema still.
> > - The remaining 3 would not. There was no extra load was on the systems, 
> > operational status was very clean. All nodes could see each other.
> > - For each of the remaining 3 we tried to update the schema through a local 
> > cqlsh session. The same ALTER TABLE would just hang forever.
> > - We restarted Cassandra on each of the 3 nodes, then did the ALTER TABLE 
> > again. It worked this time. We finally had schema agreement.
> > 
> > - Starting with just 1 node, we kicked off upgradesstables, hoping it would 
> > rebuild the 5Mb tables to 256Mb tables.
> > - Nothing happened. This was (afaik) because the sstable size change 
> > doesn't represent a new version of schema for the sstables. So existing 
> > tables are ignored.
> > - We discovered the "-a" option for upgradesstables, which tells it to skip 
> > the schema check just and just do all the tables anyway.
> > - We ran upgradesstables -a and things started happening. After a few hours 
> > the pending compactions finished.
> > - Sadly, this node was now using 3x the disk it previously had. Some 
> > sstables were now 256Mb, but not all. There were tens of thousands of ~20Mb 
> > tables.
> > - A direct comparison to other nodes owning the same % of the ring showed 
> > both the same number of sstables and the same ratio of 256Mb+ tables to 
> > small tables. However, on a 'normal' node the small tables were all 5-6Mb 
> > and on the fat, upgraded node, all the tables were 20Mb+. This was why the 
> > fat node was taking up 3x disk overall.
> > - I tried to see what was in those 20Mb files relative to the 5Mb ones but 
> > sstable2json failed against our authenticated keyspace. I filed a bug 
> > (https://issues.apache.org/jira/browse/CASSANDRA-6450). 
> > - Had little choice here. We shut down the fat node, did a manual delete of 
> > sstab

Multi-Column Slice Query w/ Partial Component of Composite Key

2013-12-20 Thread Josh Dzielak

Is there a way to include *multiple* column names in a slice query where one 
only component of the composite column name key needs to match?  

For example, if this was a single row -

username:0   |   username:1   |  city:0   |  city:1 |   other:0|   
other:1
---
bob  |   ted  |  sf   |  nyc|   foo|   bar

I can do a slice with "username:0" and "city:1" or any fully identified column 
names. I also can do a range query w/ first component equal to "username", and 
set the bounds for the second component of the key to +/- infinity (or \u0 
to \u for utf8), and get all columns back that start with "username".

But what if I want to get all usernames and all cities? Without composite keys 
this would be easy - just slice on a collection of column names - ["username", 
"city"]. With composite column names it would have to look something like 
["username:*", "city:*"], where * represents a wildcard or a range.

My questions –

1) Is this supported in the Thrift interface or CQL?
2) If not, is there clever data modeling or indexing that could accomplish this 
use case? 1 single-row round-trip to get these columns?
3) Is there plans to support this in the future? Generally, what is the future 
of composite columns in a CQL world?

Thanks!
Josh

Re: Multi-Column Slice Query w/ Partial Component of Composite Key

2013-12-20 Thread Josh Dzielak

Thanks Nate.  

I will take a look at extending thrift, seems like this could be useful for 
some folks.  


On Friday, December 20, 2013 at 12:29 PM, Nate McCall wrote:

> >  
> > My questions –
> >  
> > 1) Is this supported in the Thrift interface or CQL?
>  
> Not directly, no.  
>   
> > 2) If not, is there clever data modeling or indexing that could accomplish 
> > this use case? 1 single-row round-trip to get these columns?
> >  
>  
>  
> If this is a query done frequently you could prefix both columns with a 
> static value, eg. ["foo:username", foo:city...", "bar:other_column:..."] 
> so in this specific case you look for 'foo:*'  
>   
> > 3) Is there plans to support this in the future? Generally, what is the 
> > future of composite columns in a CQL world?
> >  
>  
> You can always extend cassandra.thrift and add a custom method (not as hard 
> as it sounds - Thrift is designed for this). Side note: DataStax Enterprise 
> works this way for reading the CassandraFileSystem blocks. An early 
> prototype:  
> https://github.com/riptano/brisk/blob/master/interface/brisk.thrift#L68-L80  
>  
>  
>  
> --  
> -
> Nate McCall
> Austin, TX
> @zznate
>  
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com

Can't write to row key, even at ALL. Tombstones?

2013-12-27 Thread Josh Dzielak

We have a few row keys that aren’t taking any writes, even using both the ALL 
consistency level to read and write. We can’t insert anything into any column, 
previously existing or new, using the simplest possible heuristic in cqlsh or 
cassandra-cli.  

Our suspicion is that we somehow have a row level tombstone that is 
future-dated and has not gone away (we’ve lowered gc_grace_seconds in hope that 
it’d get compacted, but no luck so far, even though the stables that hold the 
row key have all cycled since).

How can we make this row key writeable again? Or even better, what can we do to 
debug this? Is there a way to get Cassandra to log all of the read candidates, 
including timestamp and host and sstable, before it chooses the one to use?  

Thanks as always,
-Josh

ccm support for Windows

2014-03-14 Thread Josh McKenzie

As of today, ccm supports Windows.  It
should work in both cygwin and the general command-prompt though there are
some known issues right now which are documented in the README.

If any Windows users are so inclined to test or tinker I'd be happy to
field questions / concerns / fix bugs; feel free to email me directly about
it.

-- 
Joshua McKenzie
DataStax -- The Apache Cassandra Company

Re: ccm support for Windows

2014-03-14 Thread Josh McKenzie

The windows dtests take another pull request - this one incredibly minor -
to fix some of the odd pathing in Windows.  I'll get that in today.


On Fri, Mar 14, 2014 at 1:29 PM, Robert Coli  wrote:

> On Fri, Mar 14, 2014 at 11:23 AM, Josh McKenzie <
> josh.mcken...@datastax.com> wrote:
>
>> As of today, ccm <https://github.com/pcmanus/ccm>supports Windows.  It
>> should work in both cygwin and the general command-prompt though there are
>> some known issues right now which are documented in the README.
>>
>> If any Windows users are so inclined to test or tinker I'd be happy to
>> field questions / concerns / fix bugs; feel free to email me directly about
>> it.
>>
>
> As this enables Windows dtests, I find this quite exciting!
> Congratulations and thanks!
>
> =Rob
>
>


-- 
Joshua McKenzie
DataStax -- The Apache Cassandra Company

Re: ccm support for Windows

2014-03-14 Thread Josh McKenzie

dtest changes are merged in.  Have fun with that Rob.  ;)

The few I poked at weren't looking clean on Windows - there may be some
timing / wait issues in ccm that aren't playing nice but it's a step in the
right direction.  I'm hoping to overhaul the Windows launching process
before moving on to tightening up ccm and then fixing unit tests and dtests.


On Fri, Mar 14, 2014 at 2:03 PM, Josh McKenzie
wrote:

> The windows dtests take another pull request - this one incredibly minor -
> to fix some of the odd pathing in Windows.  I'll get that in today.
>
>
> On Fri, Mar 14, 2014 at 1:29 PM, Robert Coli  wrote:
>
>> On Fri, Mar 14, 2014 at 11:23 AM, Josh McKenzie <
>> josh.mcken...@datastax.com> wrote:
>>
>>> As of today, ccm <https://github.com/pcmanus/ccm>supports Windows.  It
>>> should work in both cygwin and the general command-prompt though there are
>>> some known issues right now which are documented in the README.
>>>
>>> If any Windows users are so inclined to test or tinker I'd be happy to
>>> field questions / concerns / fix bugs; feel free to email me directly about
>>> it.
>>>
>>
>> As this enables Windows dtests, I find this quite exciting!
>> Congratulations and thanks!
>>
>> =Rob
>>
>>
>
>
> --
> Joshua McKenzie
> DataStax -- The Apache Cassandra Company
>



-- 
Joshua McKenzie
DataStax -- The Apache Cassandra Company

Re: Windows uname -o not supported.

2014-07-01 Thread Josh McKenzie

That uname call is in there specifically to check if you're running from
cygwin and force a pidfile to be written for stop-server.bat if so; ctrl+c
functionality in mintty is inconsistent with its signal trapping across
both cygwin and Windows releases.

I wrote the PowerShell start-up scripts with the expectation that they
would be called from the .bat files so please give it a shot running the
batch files from a command-prompt or mintty terminal (or whatever your
terminal of choice is).  I'll look into what's broken when running directly
from PowerShell as this may be a more common use-case than I realized.

Along with that - are you running a 32-bit JVM or 64-bit here?

Xms3072M -Xmx3072M -Xmn768M

would lead to that kind of failure on a 32-bit JVM, though CASSANDRA-7353
 should have resolved
that potential problem.  Might be related to the environment in PowerShell
rather than a command-prompt as well.


On Mon, Jun 30, 2014 at 5:11 AM, Lars Schouw  wrote:

> How do I start cassandra on Windows? And what does my environment have to
> look like?
>
> I am getting an error when starting cassandra on Windows... uname -o not
> supported.
> I am using uname (GNU sh-utils) 2.0
> I am not running in cygwin but just a pure powershell Window.
>
> My Cassandra version comes directly from git
>
> Here is what I tried:
>
> D:\dev\3rdparty\cassandra\cassandra [trunk +1 ~0 -0 !]>
> .\bin\cassandra.bat -v -f
> Detected powershell execution permissions.  Running with enhanced startup
> scripts.
> Sourcing cassandra config file:
> D:/dev/3rdparty/cassandra/cassandra/conf/cassandra-env.ps1
> Setting up Cassandra environment
> Starting cassandra server
> Running cassandra with: [java.exe
>  -javaagent:"D:\dev\3rdparty\cassandra\cassandra\lib\jamm-0.2.6.jar" -ea
> -Dlog4j.defa
> ultInitOverride=true -XX:+CMSClassUnloadingEnabled
> -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms3072M -Xmx3
> 072M -Xmn768M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
> -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSwe
> epGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
>  -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB
> -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=71
> 99 -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false  -Dlog4j.configuration=lo
> g4j-server.properties -Dcassandra -Dlogback.configurationFile=logback.xml
> -Dcassandra.logdir="D:\dev\3rdparty\cassandra
> \cassandra/logs"
> -Dcassandra.storagedir="D:\dev\3rdparty\cassandra\cassandra/data" -cp
> "D:\dev\3rdparty\cassandra\cassa
>
> ndra\conf";"D:/dev/3rdparty/cassandra/cassandra/lib/airline-0.6.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/antlr-run
>
> time-3.5.2.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/commons-cli-1.1.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/
>
> commons-codec-1.2.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/commons-lang3-3.1.jar";"D:/dev/3rdparty/cassandra/cassa
>
> ndra/lib/commons-math3-3.2.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/compress-lzf-0.8.4.jar";"D:/dev/3rdparty/cassa
>
> ndra/cassandra/lib/concurrentlinkedhashmap-lru-1.4.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/disruptor-3.0.1.jar";"
>
> D:/dev/3rdparty/cassandra/cassandra/lib/guava-16.0.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/high-scale-lib-1.0.6.j
>
> ar";"D:/dev/3rdparty/cassandra/cassandra/lib/jackson-core-asl-1.9.2.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/jacks
>
> on-mapper-asl-1.9.2.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/jamm-0.2.6.jar";"D:/dev/3rdparty/cassandra/cassandra/
>
> lib/javax.inject.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/jbcrypt-0.3m.jar";"D:/dev/3rdparty/cassandra/cassandra/l
>
> ib/jline-1.0.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/jna-4.0.0.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/json
>
> -simple-1.1.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/libthrift-0.9.1.jar";"D:/dev/3rdparty/cassandra/cassandra/lib
>
> /logback-classic-1.1.2.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/logback-core-1.1.2.jar";"D:/dev/3rdparty/cassandra
>
> /cassandra/lib/lz4-1.2.0.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/metrics-core-2.2.0.jar";"D:/dev/3rdparty/cassand
>
> ra/cassandra/lib/netty-all-4.0.20.Final.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/reporter-config-2.1.0.jar";"D:/de
>
> v/3rdparty/cassandra/cassandra/lib/slf4j-api-1.7.2.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/snakeyaml-1.11.jar";"D
>
> :/dev/3rdparty/cassandra/cassandra/lib/snappy-java-1.0.5.1.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/stream-2.5.2.j
>
> ar";"D:/dev/3rdparty/cassandra/cassandra/lib/stringtemplate-4.0.2.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/super-c
>
> sv-2.1.0.jar";"D:/dev/3rdparty/cassandra/cassandra/lib/thrift-server-0.3.5.jar";"D:\dev\3rdparty\cassandra\cassandra\bu
> ild\classes\main";"D:\dev\3rdparty\cassandra\cassa

Windows-aware Cassandra

2014-07-11 Thread Josh McKenzie

With the release of Cassandra 2.1.0-rc3, the Cassandra team would like to
open the doors to widespread testing of Cassandra on Windows.  As of this
release we have most of the platform-specific kinks ironed out and would
like to get this into the hands of more developers and users and collect
your feedback.

Please take Cassandra for a spin on the platform, kick the tires, and let
us know of any problems or frustrations you run into.  Some highlights of
what we've been working on as well as some known issues with the 2.X line
of releases on Windows:

*2.1.0:*

   - Launch-script platform parity - re-written in PowerShell
  - Heap size determination based on system memory
  - Runtime override of cassandra-env.ps1 available
  - True background daemon start-up option
  - Graceful shutdown sending ctrl+c to running process via
  stop-server.bat
  - Heap dump and error file output redirection
  - Variety of tuning parameters available in conf/cassandra-env.ps1
  - Based on PowerShell 2.0 - works on both Win7/2008 and Win8/2012
   - Tightened up and constrained some features that were causing
   inconsistent file-system behavior on Windows
   - Cleaned up platform-specific issues uncovered by unit tests

*Known issues:*

   - Snapshot-based repair is disabled until version 3.0 due to java file
   creation flags pre-nio.2 (java bug reference
   ) (CASSANDRA-6907
   )
   - Attempts to delete snapshots will throw IOExceptions if SSTableReaders
   have segments of the original SSTables open

*Features deferred to 3.0:*

   - nio.2-based File I/O for FILE_SHARE_DELETE flag on files (
   CASSANDRA-4050 )
   - Snapshot-based repair (CASSANDRA-6907
   )
   - Pre-emptive opening of compaction results (CASSANDRA-6916
    / CASSANDRA-7365
   )


*JIRA Reference*:
Open Windows tickets

Completed Windows JIRA tickets up to release 2.1.0-rc3

Completed Windows JIRA tickets release 3.0


-- 
Joshua McKenzie
DataStax -- The Apache Cassandra Company

Using sstableloader

2012-02-02 Thread Josh Behrends

I'm new to administering Cassandra so please be kind!

I've been tasked with upgrading a .6 cluster to 1.0.7.  In doing this I
need a rollback plan in case things go sideways since my window for the
upgrade is fairly small.  So we've decided to stand up a brand new cluster
running 1.0.7 and then stop the old cluster, snapshot the data, and then
move it over to the new cluster.

So I need to know, can I use sstableloader to take the snapshot data from
my .6 cluster and stream it into my new 1.0.7 cluster?  I should also note
that the new cluster MAY NOT have the same number of nodes either.

Josh

Guaranteeing globally unique TimeUUID's in a high throughput distributed system

2013-03-16 Thread Josh Dzielak

I have a system where a client sends me arbitrary JSON events containing a 
timestamp at millisecond resolution. The timestamp is used to generate column 
names of type TimeUUIDType.

The problem I run into is this - if I client sends me 2 events with the same 
timestamp, the TimeUUID that gets generated for each is the same, and we get 1 
insert and 1 update instead of 2 inserts. I might be running many processes (in 
my case Storm supervisors) on the same node, so the machine-specific part of 
the UUID doesn't help.

I have noticed how the Cassandra UUIDGen class lets you work around this. It 
has a 'createTimeSafe' method that adds extra precision to the timestamp such 
that you can actually get up to 10k unique UUID's for the same millisecond. 
That works pretty good for a single process (although it's still possible to go 
over 10k, it's unlikely in our actual production scenario). It does make 
searches at boundary conditions a little unpredictable – 'equal' may or may not 
work depending on whether extra ns intervals were added – but I can live with 
that.)  

However, this still leaves vulnerability across a distributed system. If 2 
events arrive in 2 processes at the exact same millisecond, one will overwrite 
the other. If events keep flowing to each process evenly over the course of the 
millisecond, we'll be left with roughly half the events we should have. To work 
around this, I add a distinct 'component id' to my row keys that roughly 
equates to a Storm worker or a JVM process I can cheaply synchronize.

The real problem is that this trick of adding ns intervals only works when you 
are generating timestamps from the current time (or any time that's always 
increasing). As I mentioned before, my client might be providing a past or 
future timestamp, and I have to find a way to make sure each one is unique.

For example, a client might send me 10k events with the same millisecond 
timestamp today, and 10k again tomorrow. Using the standard Java library stuff 
to generate UUID's, I'd end up with only 1 event stored, not 20,000. The 
warning in UUIDGen.getTimeUUIDBytes is clear about this.  

Adapting the ns-adding 'trick' to this problem requires synchronized external 
state (i.e. storing that the current ns interval for millisecond 12330982383 is 
1234, etc) - definitely a non-starter.

So, my dear, and far more seasoned Cassandra users, do you have any suggestions 
for me?  

Should I drop TimeUUID altogether and just make column names a combination of 
millisecond and a big enough random part to be safe? e.g. 
'1363467790212-a6c334fefda'. Would I be able to run proper slice queries if I 
did this? What other problems might crop up? (It seems too easy :)  

Or should I just create a normal random UUID for every event as the column key 
and create the non-unique index by time in some other way?  

Would appreciate any thoughts, suggestions, and off-the-wall ideas!  

PS- I assume this could be a problem in any system (not just Cassandra) where 
you want to use 'time' as a unique index yet might have multiple records for 
the same time. So any solutions from other realms could be useful too.   

--
Josh Dzielak 
VP Engineering • Keen IO
Twitter • @dzello (https://twitter.com/dzello)
Mobile • 773-540-5264

Re: Guaranteeing globally unique TimeUUID's in a high throughput distributed system

2013-03-16 Thread Josh Dzielak

Thanks Ariel. That works in the case where the timestamp is always increasing, 
i.e. the monotonically increasing clock.  

The problem for me is that the timestamps can be provided by the client, and 
they may be in the past or the future (I only generate the timestamp using the 
current time if no other timestamp was provided). So the counter method isn't 
guaranteed to work unless I store a separate counter for every separate 
millisecond I see. So k entries in a map where k is unique millisecond and v is 
where the counter is at for that millisecond. I feel like that could get 
unwieldily. And it still wouldn't get me out of the 10,000 unique events per 1 
ms cap (maybe there's another way to handle that).

Will definitely check out those links though, appreciate it.  

On Saturday, March 16, 2013 at 2:31 PM, Ariel Weisberg wrote:

> Hi,
>
> This has been solved a couple of times, and always pretty much the same way. 
> Encode the id of the worker generating the id into the timestamp, and as you 
> mentioned, maintain a counter for each millisecond.
>
> https://github.com/twitter/snowflake
> https://github.com/VoltDB/voltdb/blob/master/src/frontend/org/voltdb/iv2/UniqueIdGenerator.java
> http://boundary.com/blog/2012/01/12/flake-a-decentralized-k-ordered-unique-id-generator-in-erlang/
>
> Regards,
> Ariel  
>   
>   
> On Sat, Mar 16, 2013, at 05:24 PM, Josh Dzielak wrote:
> > I have a system where a client sends me arbitrary JSON events containing a 
> > timestamp at millisecond resolution. The timestamp is used to generate 
> > column names of type TimeUUIDType.
> >
> > The problem I run into is this - if I client sends me 2 events with the 
> > same timestamp, the TimeUUID that gets generated for each is the same, and 
> > we get 1 insert and 1 update instead of 2 inserts. I might be running many 
> > processes (in my case Storm supervisors) on the same node, so the 
> > machine-specific part of the UUID doesn't help.
> >
> > I have noticed how the Cassandra UUIDGen class lets you work around this. 
> > It has a 'createTimeSafe' method that adds extra precision to the timestamp 
> > such that you can actually get up to 10k unique UUID's for the same 
> > millisecond. That works pretty good for a single process (although it's 
> > still possible to go over 10k, it's unlikely in our actual production 
> > scenario). It does make searches at boundary conditions a little 
> > unpredictable – 'equal' may or may not work depending on whether extra ns 
> > intervals were added – but I can live with that.)  
> >
> > However, this still leaves vulnerability across a distributed system. If 2 
> > events arrive in 2 processes at the exact same millisecond, one will 
> > overwrite the other. If events keep flowing to each process evenly over the 
> > course of the millisecond, we'll be left with roughly half the events we 
> > should have. To work around this, I add a distinct 'component id' to my row 
> > keys that roughly equates to a Storm worker or a JVM process I can cheaply 
> > synchronize.
> >
> > The real problem is that this trick of adding ns intervals only works when 
> > you are generating timestamps from the current time (or any time that's 
> > always increasing). As I mentioned before, my client might be providing a 
> > past or future timestamp, and I have to find a way to make sure each one is 
> > unique.
> >
> > For example, a client might send me 10k events with the same millisecond 
> > timestamp today, and 10k again tomorrow. Using the standard Java library 
> > stuff to generate UUID's, I'd end up with only 1 event stored, not 20,000. 
> > The warning in UUIDGen.getTimeUUIDBytes is clear about this.  
> >
> > Adapting the ns-adding 'trick' to this problem requires synchronized 
> > external state (i.e. storing that the current ns interval for millisecond 
> > 12330982383 is 1234, etc) - definitely a non-starter.
> >
> > So, my dear, and far more seasoned Cassandra users, do you have any 
> > suggestions for me?  
> >
> > Should I drop TimeUUID altogether and just make column names a combination 
> > of millisecond and a big enough random part to be safe? e.g. 
> > '1363467790212-a6c334fefda'. Would I be able to run proper slice queries if 
> > I did this? What other problems might crop up? (It seems too easy :)  
> >
> > Or should I just create a normal random UUID for every event as the column 
> > key and create the non-unique index by time in some other way?  
> >
> > Would appreciate any thoughts, suggestions, and off-the-wall ideas!  
> >
> > PS- I assume this could be a problem in any system (not just Cassandra) 
> > where you want to use 'time' as a unique index yet might have multiple 
> > records for the same time. So any solutions from other realms could be 
> > useful too.  
> >
> > --
> > Josh Dzielak 
> > VP Engineering •Keen IO
> > Twitter • @dzello (https://twitter.com/dzello)
> > Mobile • 773-540-5264
> >

Re: Guaranteeing globally unique TimeUUID's in a high throughput distributed system

2013-03-16 Thread Josh Dzielak

Thanks Philip. I see where you are coming from; that'd be much simpler and 
avoid these bumps.

The only downside is that I'd have to separately maintain an index of event 
timestamps that reflected when they happened according to the client. That way 
when the client asks for 'events last Wednesday' I give them the right answer 
even if the events were recorded in Cassandra today. I think it's at least 
worth weighing against the other solution.  

On Saturday, March 16, 2013 at 2:40 PM, Philip O'Toole wrote:

> On Sat, Mar 16, 2013 at 2:24 PM, Josh Dzielak  (mailto:j...@keen.io)> wrote:
> > I have a system where a client sends me arbitrary JSON events containing a
> > timestamp at millisecond resolution. The timestamp is used to generate
> > column names of type TimeUUIDType.
> >  
> > The problem I run into is this - if I client sends me 2 events with the same
> > timestamp, the TimeUUID that gets generated for each is the same, and we get
> > 1 insert and 1 update instead of 2 inserts. I might be running many
> > processes (in my case Storm supervisors) on the same node, so the
> > machine-specific part of the UUID doesn't help.
> >  
> > I have noticed how the Cassandra UUIDGen class lets you work around this. It
> > has a 'createTimeSafe' method that adds extra precision to the timestamp
> > such that you can actually get up to 10k unique UUID's for the same
> > millisecond. That works pretty good for a single process (although it's
> > still possible to go over 10k, it's unlikely in our actual production
> > scenario). It does make searches at boundary conditions a little
> > unpredictable – 'equal' may or may not work depending on whether extra ns
> > intervals were added – but I can live with that.)
> >  
> > However, this still leaves vulnerability across a distributed system. If 2
> > events arrive in 2 processes at the exact same millisecond, one will
> > overwrite the other. If events keep flowing to each process evenly over the
> > course of the millisecond, we'll be left with roughly half the events we
> > should have. To work around this, I add a distinct 'component id' to my row
> > keys that roughly equates to a Storm worker or a JVM process I can cheaply
> > synchronize.
> >  
> > The real problem is that this trick of adding ns intervals only works when
> > you are generating timestamps from the current time (or any time that's
> > always increasing). As I mentioned before, my client might be providing a
> > past or future timestamp, and I have to find a way to make sure each one is
> > unique.
> >  
> > For example, a client might send me 10k events with the same millisecond
> > timestamp today, and 10k again tomorrow. Using the standard Java library
> > stuff to generate UUID's, I'd end up with only 1 event stored, not 20,000.
> > The warning in UUIDGen.getTimeUUIDBytes is clear about this.
> >  
>  
>  
> It is a mistake, IMHO, to use the timestamp contained within the event
> to generate the time-based UUID. While it will work, it suffers from
> exactly the problem you describe. Instead, use the clock of the host
> system to generate the timestamp. In otherwords, the event timestamp
> may be different from the timestamp in the UUID. In fact, it *will* be
> different, if the rate gets fast enough (since the 100ns period clock
> used to generate time-based UUIDs may not be fine-grained enough, and
> the UUID timestamp will increase as explained by RFC4122).
>  
> >  
> > Adapting the ns-adding 'trick' to this problem requires synchronized
> > external state (i.e. storing that the current ns interval for millisecond
> > 12330982383 is 1234, etc) - definitely a non-starter.
> >  
> > So, my dear, and far more seasoned Cassandra users, do you have any
> > suggestions for me?
> >  
> > Should I drop TimeUUID altogether and just make column names a combination
> > of millisecond and a big enough random part to be safe? e.g.
> > '1363467790212-a6c334fefda'. Would I be able to run proper slice queries if
> > I did this? What other problems might crop up? (It seems too easy :)
> >  
> > Or should I just create a normal random UUID for every event as the column
> > key and create the non-unique index by time in some other way?
> >  
> > Would appreciate any thoughts, suggestions, and off-the-wall ideas!
> >  
> > PS- I assume this could be a problem in any system (not just Cassandra)
> > where you want to use 'time' as a unique index yet might have multiple
> > records for the same time. So any solutions from other realms could be
> > useful too.
> >  
> > --
> > Josh Dzielak
> > VP Engineering • Keen IO
> > Twitter • @dzello
> > Mobile • 773-540-5264
> >  
>  
>  
>

Re: Guaranteeing globally unique TimeUUID's in a high throughput distributed system

2013-03-16 Thread Josh Dzielak

Ahh right on. I'm already using wide rows with a similar row key heuristic 
(basically MMDDHH, pulled from the event_time). So I think I'm good there 
but hadn't thought about using a mod instead - any in-practice advantages to 
that?

Excited to try composite columns for this- sounds ideal. Had a similar idea of 
concatenating a UUID onto the event time manually but this looks the right, 
non-janky way to do that.

Would you just use a type 4 UUID then, since the range slicing/querying will be 
on the event_time part? Or are there advantages to still using a time UUID with 
the thread/process uniqueness tricks you mentioned?

Thanks Philip!  

On Saturday, March 16, 2013 at 2:56 PM, Philip O'Toole wrote:

> On Sat, Mar 16, 2013 at 2:50 PM, Josh Dzielak  (mailto:j...@keen.io)> wrote:
> > Thanks Philip. I see where you are coming from; that'd be much simpler and
> > avoid these bumps.
> >  
> > The only downside is that I'd have to separately maintain an index of event
> > timestamps that reflected when they happened according to the client. That
> > way when the client asks for 'events last Wednesday' I give them the right
> > answer even if the events were recorded in Cassandra today. I think it's at
> > least worth weighing against the other solution.
> >  
>  
>  
> Way ahead of you. Use wide-rows, and use the UUID to create a
> composite column key. like so:
>  
> event_time:UUID
>  
> This guarantees a unique ID for *every* event.
>  
> And use the "event_time % (some interval you choose)" as your row key
> (many events will then have this as their row key). This makes it easy
> to find the events within a given range by performing the modulo math
> on the requested time range (you must choose the interval as part of
> your design, and stick with it). You do not need a secondary index.
>  
> >  
> > On Saturday, March 16, 2013 at 2:40 PM, Philip O'Toole wrote:
> >  
> > On Sat, Mar 16, 2013 at 2:24 PM, Josh Dzielak  > (mailto:j...@keen.io)> wrote:
> >  
> > I have a system where a client sends me arbitrary JSON events containing a
> > timestamp at millisecond resolution. The timestamp is used to generate
> > column names of type TimeUUIDType.
> >  
> > The problem I run into is this - if I client sends me 2 events with the same
> > timestamp, the TimeUUID that gets generated for each is the same, and we get
> > 1 insert and 1 update instead of 2 inserts. I might be running many
> > processes (in my case Storm supervisors) on the same node, so the
> > machine-specific part of the UUID doesn't help.
> >  
> > I have noticed how the Cassandra UUIDGen class lets you work around this. It
> > has a 'createTimeSafe' method that adds extra precision to the timestamp
> > such that you can actually get up to 10k unique UUID's for the same
> > millisecond. That works pretty good for a single process (although it's
> > still possible to go over 10k, it's unlikely in our actual production
> > scenario). It does make searches at boundary conditions a little
> > unpredictable – 'equal' may or may not work depending on whether extra ns
> > intervals were added – but I can live with that.)
> >  
> > However, this still leaves vulnerability across a distributed system. If 2
> > events arrive in 2 processes at the exact same millisecond, one will
> > overwrite the other. If events keep flowing to each process evenly over the
> > course of the millisecond, we'll be left with roughly half the events we
> > should have. To work around this, I add a distinct 'component id' to my row
> > keys that roughly equates to a Storm worker or a JVM process I can cheaply
> > synchronize.
> >  
> > The real problem is that this trick of adding ns intervals only works when
> > you are generating timestamps from the current time (or any time that's
> > always increasing). As I mentioned before, my client might be providing a
> > past or future timestamp, and I have to find a way to make sure each one is
> > unique.
> >  
> > For example, a client might send me 10k events with the same millisecond
> > timestamp today, and 10k again tomorrow. Using the standard Java library
> > stuff to generate UUID's, I'd end up with only 1 event stored, not 20,000.
> > The warning in UUIDGen.getTimeUUIDBytes is clear about this.
> >  
> >  
> > It is a mistake, IMHO, to use the timestamp contained within the event
> > to generate the time-based UUID. While it will work, it suffers from
> > exactly the problem you describe. Instead,

Counter value becomes incorrect after several dozen reads & writes

2013-06-24 Thread Josh Dzielak

I have a loop that reads a counter, increments it by some integer, then goes 
off and does about 500ms of other work. After about 10 iterations of this loop, 
the counter value *sometimes* appears to be corrupted.

Looking at the logs, a sequence that just happened is:

Read counter - 15000
Increase counter by - 353
Read counter - 15353
Increase counter by - 1067
Read counter - 286079 (the new counter value is *very* different than what the 
increase should have produced, but usually, suspiciously, around 280k)
Increase counter by - 875
Read counter - 286079  (the counter stops changing at a certain point)


There is only 1 thread running this sequence, and consistency levels are set to 
ALL. The behavior is fairly repeatable - the unexpectation mutation will happen 
at least 10% of the time I run this program, but at different points. When it 
does not go awry, I can run this loop many thousands of times and keep the 
counter exact. But if it starts happening to a specific counter, the counter 
will never "recover" and will continue to maintain it's incorrect value even 
after successful subsequent writes.

I'm using the latest Astyanax driver on Cassandra 1.2.3 in a 3-node test 
cluster. It's also happened in development. Has anyone seem something like 
this? It feels almost too strange to be an actual bug but I'm stumped and have 
been looking at it too long :)

Thanks,
Josh

--
Josh Dzielak 
VP Engineering • Keen IO
Twitter • @dzello (https://twitter.com/dzello)
Mobile • 773-540-5264

Re: Counter value becomes incorrect after several dozen reads & writes

2013-06-24 Thread Josh Dzielak

Hi Arthur,  

This is actually for a column in a counter column family, i.e. 
CounterColumnType. Will check out that thread though, thanks.

Best,
Josh

--
Josh Dzielak 
VP Engineering • Keen IO
Twitter • @dzello (https://twitter.com/dzello)
Mobile • 773-540-5264


On Monday, June 24, 2013 at 8:01 PM, Arthur Zubarev wrote:

> Hi Josh,
>   
> are you looking at the read counter produced by cfstats?
>   
> If so it is not for a CF, but the entire KS and not tied to a specific 
> operation, but rather per the entire lifetime of JVM.
>   
> Just in case, some supporting info: 
> http://stackoverflow.com/questions/9431590/cassandra-cfstats-and-meaning-of-read-write-latency
>   
> /Arthur
>   
> From: Josh Dzielak (mailto:j...@keen.io)  
> Sent: Monday, June 24, 2013 9:42 PM
> To: user@cassandra.apache.org (mailto:user@cassandra.apache.org)  
> Subject: Counter value becomes incorrect after several dozen reads & writes
>  
>  
>   
>  
> I have a loop that reads a counter, increments it by some integer, then goes 
> off and does about 500ms of other work. After about 10 iterations of this 
> loop, the counter value *sometimes* appears to be corrupted.
>   
> Looking at the logs, a sequence that just happened is:
>   
> Read counter - 15000
> Increase counter by - 353
> Read counter - 15353
> Increase counter by - 1067
> Read counter - 286079 (the new counter value is *very* different than what 
> the increase should have produced, but usually, suspiciously, around 280k)
> Increase counter by - 875
> Read counter - 286079  (the counter stops changing at a certain point)
>  
>   
> There is only 1 thread running this sequence, and consistency levels are set 
> to ALL. The behavior is fairly repeatable - the unexpectation mutation will 
> happen at least 10% of the time I run this program, but at different points. 
> When it does not go awry, I can run this loop many thousands of times and 
> keep the counter exact. But if it starts happening to a specific counter, the 
> counter will never "recover" and will continue to maintain it's incorrect 
> value even after successful subsequent writes.
>   
> I'm using the latest Astyanax driver on Cassandra 1.2.3 in a 3-node test 
> cluster. It's also happened in development. Has anyone seem something like 
> this? It feels almost too strange to be an actual bug but I'm stumped and 
> have been looking at it too long :)
>   
> Thanks,
> Josh
>   
> --
> Josh Dzielak 
> VP Engineering • Keen IO
> Twitter • @dzello (https://twitter.com/dzello)
> Mobile • 773-540-5264
>   
>  
>  
>  
>  
>  
>

Re: Upgrade strategy for high number of nodes

2019-11-29 Thread Josh Snyder

Hello Shishir,

It shouldn't be necessary to take downtime to perform upgrades of a
Cassandra cluster. It sounds like the biggest issue you're facing is the
upgradesstables step. upgradesstables is not strictly necessary before a
Cassandra node re-enters the cluster to serve traffic; in my experience it
is purely for optimizing the performance of the database once the software
upgrade is complete. I recommend trying out an upgrade in a test
environment without using upgradesstables, which should bring the 5 hours
per node down to just a few minutes.

If you're running NetworkTopologyStrategy and you want to optimize further,
you could consider performing the upgrade on multiple nodes within the same
rack in parallel. When correctly configured, NetworkTopologyStrategy can
protect your database from an outage of an entire rack. So performing an
upgrade on a few nodes at a time within a rack is the same as a partial
rack outage, from the database's perspective.

Have a nice upgrade!

Josh

On Fri, Nov 29, 2019 at 7:22 AM Shishir Kumar 
wrote:

> Hi,
>
> Need input on cassandra upgrade strategy for below:
> 1. We have Datacenter across 4 geography (multiple isolated deployments in
> each DC).
> 2. Number of Cassandra nodes in each deployment is between 6 to 24
> 3. Data volume on each nodes between 150 to 400 GB
> 4. All production environment has DR set up
> 5. During upgrade we do not want downtime
>
> We are planning to go for stack upgrade but upgradesstables is taking
> approx. 5 hours per node (if data volume is approx 200 GB).
> Options-
> No downtime - As per recommendation (DataStax documentation) if we plan to
> upgrade one node at time I.e. in sequence upgrade cycle for one environment
> will take weeks, so DevOps concern.
> Read Only (No downtime) - Route read only load to DR system. We have
> resilience built up to take care of mutation scenarios. But incase it takes
> more than say 3-4 hours, there will be long catch up exercise. Maintenance
> cost seems too high due to unknowns
> Downtime- Can upgrade all nodes in parallel as no live customers. This has
> direct Customer impact, so need to convince on maintenance cost vs customer
> impact.
> Please suggest how other Organisation are solving this scenario (whom have
> 100+ nodes)
>
> Regards
> Shishir
>
>

Re: How beta is 4.0-beta3?

2020-11-24 Thread Josh Snyder

If you are able to build Cassandra yourself, one other option is to
backport the ZstdCompressor patch. That's the route we opted to take.
I've put our Zstd patch against Cassandra 3.0, as backported by Joey
Lynch, up on Github [1].

Our experience with Zstd has been that it works wonders on many kinds
of real world data. Last I checked, a typical dataset that compresses
down to 35% with LZ4 goes down to 18% with Zstd. We pay the cost in
CPU time on compaction, but it's well worth it for our use-cases (and
Zstd is so tunable that we can always adjust, if necessary). The
Cassandra 4.0 betas have an additional patch by Joey [2] that uses LZ4
for flush, so that the additional CPU requirement of Zstd compression
doesn't affect flush speed.

[1] 
https://github.com/hashbrowncipher/cassandra/commit/f79a280b1af85a03bd4f0379fb52ad06dcd62b6e
[2] 
https://github.com/apache/cassandra/commit/9c1bbf3ac913f9bdf7a0e0922106804af42d2c1e

On Tue, Nov 24, 2020 at 9:26 AM David Tinker  wrote:
>
> I could really use zstd compression! So if it's not too buggy I will take a 
> chance :) Tx
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Apache TAC: assistance for travel to Berlin Buzzwords

2023-03-24 Thread Josh McKenzie

Cassandra Community!

The Travel Assistance Committee with the Apache Foundation is supporting travel 
to Berlin Buzzwords 2023 (https://2023.berlinbuzzwords.de, 18-20 June 2023) for 
up to 6 people. This conference has lined up pretty well with our project in 
the past and would probably be a great opportunity for folks from our community 
to attend: *"Germany’s most exciting conference on storing, processing, 
streaming and searching large amounts of digital data, with a focus on open 
source software projects"*.

Please see the below message from Gavin McDonald w/the TAC:



Hi All,

The ASF Travel Assistance Committee is supporting taking up to six (6)
people to attend Berlin Buzzwords In June this year.

This includes Conference passes, and travel & accommodation as needed.

Please see our website at https://tac.apache.org for more information and
how to apply.

Applications close on 15th April.

Good luck to those that apply.

Gavin McDonald (VP TAC)

----

~Josh

Re: Unsubscribe

2023-06-20 Thread Josh McKenzie

Email user-unsubscr...@cassandra.apache.org to unsub.

https://cassandra.apache.org/_/community.html

See:
User Mailing List

For broad, opinion-based questions, general discussions, ask how to get help, 
or receive announcements, please subscribe to the user mailing list. Security 
issues need to be reported to the Apache Security Team 
.

Before submitting a new question, please search the forums above or the mailing 
list archive to see if it has already been answered.

New to the Mailing List? Read the Archives 
.

SUBSCRIBE 

 
UNSUBSCRIBE 
























On Mon, Jun 19, 2023, at 11:49 PM, Bharat Kul Ratan wrote:
>

Re: Repair errors

2023-08-06 Thread Josh McKenzie

Quick drive-by observation:
> Did not get replies from all endpoints.. Check the 
> logs on the repair participants for further details

> dropping message of type HINT_REQ due to error
> org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The
> channel this output stream was writing to has been closed

> Caused by: io.netty.channel.unix.Errors$NativeIoException:
> writeAddress(..) failed: Connection timed out

> java.lang.RuntimeException: Did not get replies from all endpoints.
These all point to the same shaped problem: for whatever reason, the 
coordinator of this repair didn't receive replies from the replicas executing 
it. Could be that they're dead, could be they took too long, could be they 
never got the start message, etc. Distributed operations are tricky like that.

Logs on the replicas doing the actual repairs should give you more insight; 
this is a pretty low level generic set of errors that basically amounts to "we 
didn't hear back from the other participants in time so we timed out."

On Fri, Aug 4, 2023, at 12:02 PM, Surbhi Gupta wrote:
> Can you please try to do nodetool describecluster from every node of the 
> cluster?
> 
> One time I noticed issue when nodetool status shows all nodes UN but 
> describecluster was not.
> 
> Thanks
> Surbhi
> 
> On Fri, Aug 4, 2023 at 8:59 AM Joe Obernberger  
> wrote:
>> Hi All - been using reaper to do repairs, but it has hung.  I tried to run:
>> nodetool repair -pr
>> on each of the nodes, but they all fail with some form of this error:
>> 
>> error: Repair job has failed with the error message: Repair command #521 
>> failed with error Did not get replies from all endpoints.. Check the 
>> logs on the repair participants for further details
>> -- StackTrace --
>> java.lang.RuntimeException: Repair job has failed with the error 
>> message: Repair command #521 failed with error Did not get replies from 
>> all endpoints.. Check the logs on the repair participants for further 
>> details
>>  at 
>> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:137)
>>  at 
>> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>>  at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:633)
>>  at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:555)
>>  at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:474)
>>  at 
>> java.management/com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor.lambda$execute$0(ClientNotifForwarder.java:108)
>>  at java.base/java.lang.Thread.run(Thread.java:829)
>> 
>> Using version 4.1.2-1
>> nodetool status
>> Datacenter: datacenter1
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address LoadTokens  Owns  Host 
>> ID   Rack
>> UN  172.16.100.45   505.66 GiB  250 ? 
>> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98  rack1
>> UN  172.16.100.251  380.75 GiB  200 ? 
>> 274a6e8d-de37-4e0b-b000-02d221d858a5  rack1
>> UN  172.16.100.35   479.2 GiB   200 ? 
>> 59150c47-274a-46fb-9d5e-bed468d36797  rack1
>> UN  172.16.100.252  248.69 GiB  200 ? 
>> 8f0d392f-0750-44e2-91a5-b30708ade8e4  rack1
>> UN  172.16.100.249  411.53 GiB  200 ? 
>> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
>> UN  172.16.100.38   333.26 GiB  200 ? 
>> 0d9509cc-2f23-4117-a883-469a1be54baf  rack1
>> UN  172.16.100.36   405.33 GiB  200 ? 
>> d9702f96-256e-45ae-8e12-69a42712be50  rack1
>> UN  172.16.100.39   437.74 GiB  200 ? 
>> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
>> UN  172.16.100.248  344.4 GiB   200 ? 
>> 4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
>> UN  172.16.100.44   409.36 GiB  200 ? 
>> b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
>> UN  172.16.100.37   236.08 GiB  120 ? 
>> 08a19658-40be-4e55-8709-812b3d4ac750  rack1
>> UN  172.16.20.16975 GiB 500 ? 
>> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297  rack1
>> UN  172.16.100.34   340.77 GiB  200 ? 
>> 352fd049-32f8-4be8-9275-68b145ac2832  rack1
>> UN  172.16.100.42   974.86 GiB  500 ? 
>> b088a8e6-42f3-4331-a583-47ef5149598f  rack1
>> 
>> Note: Non-system keyspaces don't have the same replication settings, 
>> effective ownership information is meaningless
>> 
>> Debug log has:
>> 
>> 
>> DEBUG [ScheduledTasks:1] 2023-08-04 11:56:04,955 
>> MigrationCoordinator.java:264 - Pulling unreceived schema versions...
>> INFO  [HintsDispatcher:11344] 2023-08-04 11:56:21,369 
>> HintsDispatchExecutor.java:318 - Finished hinted handoff of file 
>> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297-1690426370160-2.hints to endpoint 
>> /172.16.20.16:7000: 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297, partially
>> WARN 
>> [Messaging-O

Re: Materialized View inconsistency issue

2023-08-14 Thread Josh McKenzie

When it comes to denormalization in Cassandra today your options are to either 
do it yourself in your application layer or rely on Materialized Views to do it 
for you at the server layer. Neither are production-ready approaches out of the 
box (which is one of the biggest flaws in the "provide it server side as a 
feature" approach); both implementations will need you as a user to:
 1. Deal with failure cases (data loss in base table, consistency violations 
between base and view due to failures during write / anti-entropy vs. gc_grace, 
etc) and
 2. Manage the storage implications of a given base write and the denormalized 
writes that it spawns. This is arguably worse with MV's as you have less 
visibility into the fanout and they're easier to create; it was common to see 
folks create 5-10 views on a base table when they were first released and lock 
up tables and exhaust storage disks, not realizing the implications.
The current inability to clearly see and rely on the state of consistency 
between a base and a view is a significant limitation that's shared by both the 
MV implementation and a user-hand-rolled version. @regis I'd be super 
interested to hear more about:
> we made a spark script downloading the master table and the MV, and comparing 
> them and fixing data (as said previously we have very few errors and we run 
> it maybe once a year
Given the inclusion of the spark bulk reader and writer in the project 
ecosystem, this could prove to be something really useful for a lot of users.

In a post-Accord world with atomic durable multi-partition transactions, we 
should be able to create a more robust, consistent implementation of MV's. This 
doesn't solve the problem of "complete data loss on a base table leaves you 
with data in a view that's orphaned; you need to rebuild the view." That said, 
a Materialized Views feature that only has that one caveat of "if you lose data 
in the base you need to recreate the views" would be a significant improvement. 
It should also be pretty trivial to augment the upcoming size commands to 
support future MV's as well (CASSANDRA-12367 
)

So yeah. Denormalization is a Hard Problem. MV's were an attempt to take a 
burden off the user but we just didn't have sufficiently robust primitives to 
build on at that time to get it where it needed to go.

I'm personally still on the fence between whether a skilled user should go with 
hand-rolled vs. MV's today, but for the general populace of C* users (i.e. 
people that don't have time to get into the weeds), they're probably best 
avoided still for now.

On Thu, Aug 10, 2023, at 8:19 PM, MyWorld wrote:
> Hi surbhi ,
> There are 2 drawbacks associated with MV.
> 1. Inconsistent view
> 2. The lock it takes on the base table. This gets worse when you have huge 
> number of clustering keys in a specific partition.
> 
> It's better you re-design a seperate table and let your API do a parallel 
> write on both.
> 
> Regards,
> Ashish
> 
> On Fri, 11 Aug, 2023, 02:03 Surbhi Gupta,  wrote:
>> Thanks everyone.
>> 
>> 
>> On Wed, 9 Aug 2023 at 01:00, Regis Le Bretonnic
>>  wrote:
>> >
>> > Hi Surbhi
>> >
>> > We do use cassandra materialized views even if not recommended.
>> > There are known issues you have to make with. Despite of them, we still 
>> > use VM.
>> > What we observe is :
>> > * there are  inconsistency issues but few. Most of them are rows that 
>> > should not exist in the MV...
>> > * we made a spark script downloading the master table and the MV, and 
>> > comparing them and fixing data (as said previously we have very few errors 
>> > and we run it maybe once a year)
>> >
>> > * Things go very very very bad when you add or remove a node ! Limit this 
>> > operation if possible and do it knowing what can happen (we isolate the 
>> > ring/datacenter and fix data before putting it back to production. We did 
>> > this only once in the last 4 years).
>> >
>> > PS : all proposals avoiding MV failed for our project. Basically managing 
>> > a table like a MV (by deleting and inserting rows from code) is worse and 
>> > more corrupted than what MV does...
>> > The worse issue is adding and removing nodes. Maybe cassandra 4 improves 
>> > this point (not tested yet).
>> >
>> > Have fun...
>> >
>> > Le mar. 8 août 2023 à 22:36, Surbhi Gupta  a 
>> > écrit :
>> >>
>> >> Hi,
>> >>
>> >> We get complaints about Materialized View inconsistency issues.
>> >> We are on 3.11.5 and on 3.11.5 Materialized Views were not production 
>> >> ready.
>> >> We are ok to upgrade.
>> >>
>> >> On which version of cassandra MVs doesnt have inconsistency issues?
>> >>
>> >> Thanks
>> >> Surbhi

Re: Cassandra p95 latencies

2023-08-14 Thread Josh McKenzie

> The queries are rightly designed
Data modeling in Cassandra is 100% gray space; there unfortunately is no right 
or wrong design. You'll need to share basic shapes / contours of your data 
model for other folks to help you; seemingly innocuous things in a data model 
can cause unexpected issues w/C*'s storage engine paradigm thanks to the 
partitioning and data storage happening under the hood.

If you were seeing single digit ms on 3.0.X or 3.11.X and 40ms p95 on 4.0 I'd 
immediately look to the DB as being the culprit. For all other cases, you 
should be seeing single digit ms as queries in C* generally boil down to 
key/value lookups (partition key) to a list of rows you either point query 
(key/value #2) or range scan via clustering keys and pull back out.

There's also paging to take into consideration (whether you're using it or not, 
what your page size is) and the data itself (do you have thousands of columns? 
Multi-MB blobs you're pulling back out? etc). All can play into this.

On Fri, Aug 11, 2023, at 3:40 PM, Jeff Jirsa wrote:
> You’re going to have to help us help you 
> 
> 4.0 is pretty widely deployed. I’m not aware of a perf regression 
> 
> Can you give us a schema (anonymized) and queries and show us a trace ? 
> 
> 
>> On Aug 10, 2023, at 10:18 PM, Shaurya Gupta  wrote:
>> 
>> The queries are rightly designed as I already explained. 40 ms is way too 
>> high as compared to what I seen with other DBs and many a times with 
>> Cassandra 3.x versions.
>> CPU consumed as I mentioned is not high, it is around 20%.
>> 
>> On Thu, Aug 10, 2023 at 5:14 PM MyWorld  wrote:
>>> Hi,
>>> P95 should not be a problem if rightly designed. Levelled compaction 
>>> strategy further reduces this, however it consume some resources. For read, 
>>> caching is also helpful. 
>>> Can you check your cpu iowait as it could be the reason for delay 
>>> 
>>> Regards,
>>> Ashish
>>> 
>>> On Fri, 11 Aug, 2023, 04:58 Shaurya Gupta,  wrote:
 Hi community

 What is the expected P95 latency for Cassandra Read and Write queries 
 executed with Local_Quorum over a table with 3 replicas ? The queries are 
 done using the partition + clustering key and row size in bytes is not too 
 much, maybe 1-2 KB maximum.
 Assuming CPU is not a crunch ?

 We observe those to be 40 ms P95 Reads and same for Writes. This looks 
 very high as compared to what we expected. We are using Cassandra 4.0.

 Any documentation / numbers will be helpful.

 Thanks
 --
 Shaurya Gupta

>> 
>> 
>> --
>> Shaurya Gupta
>>

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-18 Thread Josh McKenzie

I think it's all part of the same issue and you're not derailing IMO Abe. For 
the user Pabbireddy here, the unexpected behavior was not closing internode 
connections on that keystore refresh. So ISTM, from a "featureset that would be 
nice to have here" perspective, we could theoretically provide:
 1. An option to disconnect all connections on cert update, disabled by default
 2. An option to drain and recycle connections on a time period, also disabled 
by default
Leave the current behavior in place but allow for these kind of strong 
cert-guarantees if a user needs it in their env.

On Mon, Apr 15, 2024, at 9:51 PM, Abe Ratnofsky wrote:
> Not to derail from the original conversation too far, but wanted to agree 
> that maximum connection establishment time on native transport would be 
> useful. That would provide a maximum duration before an updated client 
> keystore is used for connections, which can be used to safely roll out client 
> keystore updates.
> 
> For example, if the maximum connection establishment time is 12 hours, then 
> you can update the keystore on a canary client, wait 24 hours, confirm that 
> connectivity is maintained, then upgrade keystores across the rest of the 
> fleet.
> 
> With unbounded connection establishment, reconnection isn't tested as often 
> and issues can hide behind long-lived connections.
> 
>> On Apr 15, 2024, at 5:14 PM, Jeff Jirsa  wrote:
>> 
>> It seems like if folks really want the life of a connection to be finite 
>> (either client/server or server/server), adding in an option to quietly 
>> drain and recycle a connection on some period isn’t that difficult.
>> 
>> That type of requirement shows up in a number of environments, usually on 
>> interactive logins (cqlsh, login, walk away, the connection needs to become 
>> invalid in a short and finite period of time), but adding it to internode 
>> could also be done, and may help in some weird situations (if you changed 
>> certs because you believe a key/cert is compromised, having the connection 
>> remain active is decidedly inconvenient, so maybe it does make sense to add 
>> an expiration timer/condition on each connection).
>> 
>> 
>> 
>>> On Apr 15, 2024, at 12:28 PM, Dinesh Joshi  wrote:
>>> 
>>> In addition to what Andy mentioned, I want to point out that for the vast 
>>> majority of use-cases, we would like to _avoid_ interruptions when a 
>>> certificate is updated so it is by design. If you're dealing with a 
>>> situation where you want to ensure that the connections are cycled, you can 
>>> follow Andy's advice. It will require automation outside the database that 
>>> you might already have. If there is demand, we can consider adding a 
>>> feature to slowly cycle the connections so the old SSL context is not used 
>>> anymore.
>>> 
>>> One more thing you should bear in mind is that Cassandra will not load the 
>>> new SSL context if it cannot successfully initialize it. This is again by 
>>> design to prevent an outage when the updated truststore is corrupted or 
>>> could not be read in some way.
>>> 
>>> thanks,
>>> Dinesh
>>> 
>>> On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy  
>>> wrote:
 I should mention, when toggling disablebinary/enablebinary between
 instances, you will probably want to give some time between doing this
 so connections can reestablish, and you will want to verify that the
 connections can actually reestablish.  You also need to be mindful of
 this being disruptive to inflight queries (if your client is
 configured for retries it will probably be fine).  Semantically to
 your applications it should look a lot like a rolling cluster bounce.
 
 Thanks,
 Andy
 
 On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
  wrote:
 >
 > Thanks Andy for your reply . We will test the scenario you mentioned.
 >
 > Regards
 > Avinash
 >
 > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy  
 > wrote:
 >>
 >> Hi Avinash,
 >>
 >> As far as I understand it, if the underlying keystore/trustore(s)
 >> Cassandra is configured for is updated, this *will not* provoke
 >> Cassandra to interrupt existing connections, it's just that the new
 >> stores will be used for future TLS initialization.
 >>
 >> Via: 
 >> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
 >>
 >> > When the files are updated, Cassandra will reload them and use them 
 >> > for subsequent connections
 >>
 >> I suppose one could do a rolling disablebinary/enablebinary (if it's
 >> only client connections) after you roll out a keystore/truststore
 >> change as a way of enforcing the existing connections to reestablish.
 >>
 >> Thanks,
 >> Andy
 >>
 >>
 >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
 >>  wrote:
 >> >
 >> > Dear Community,
 >> >
 >> > I hope this email find

Re: CDC and schema disagreement

2024-09-23 Thread Josh McKenzie

Yeah; think that would need to live in `AlterTableStatement.java` in probably 
`#AlterOptions`. There's guardrail validation in there for gc_grace, mv's, 
replication strategies, and compression on 4.1 but I'm not seeing any guardrail 
for CDC.

Could you open a JIRA for that Bowen? Might even be other params that we can 
enable/disable on the node level we're not checking in the guardrails on alter; 
CDC predated guardrails so this might be an isolated oversight but /shrug.

On Mon, Sep 23, 2024, at 11:06 AM, Josh McKenzie wrote:
> I wouldn't be surprised if we don't have logic in place to handle that 
> disjoint (DDL for disabled .yaml property) at least in the case of CDC. It's 
> been the better part of a decade since that first impl but I don't have any 
> recollection of that logic being in there.
> 
> Let me take a quick look at the code and I'll get back to you; might need a 
> JIRA.
> 
> On Fri, Sep 20, 2024, at 11:50 AM, Štefan Miklošovič wrote:
>> Thank you for reporting this. I may check next week more closely and let you 
>> know.
>> 
>> On Fri, Sep 20, 2024 at 5:43 PM Bowen Song via user 
>>  wrote:
>>> Hi all,
>>> 
>>> I suspect that I've ran into a bug (or two).
>>> 
>>> On Cassandra 4.1.1, when `cdc_enabled` in the cassandra.yaml file is set
>>> to `false` on at least one node in the cluster, and then the `ALTER
>>> TABLE ... WITH cdc=...` statement was run against that node, the cluster
>>> will end up in the schema disagreement state. At this stage, a rolling
>>> restart will bring the schema back in sync, but the changes made to the
>>> `cdc` table property will be lost.
>>> 
>>> On Cassandra 4.1.6, the same procedure doesn't cause visible schema
>>> disagreement in the `nodetool describecluster` command's output, but the
>>> `ALTER TABLE` statement only has cosmetic effect on the node it is run.
>>> The node with `cdc_enabled` set to `false` will show the `cdc` table
>>> property has changed, but this does not affect its behaviour in any way.
>>> At the same time, other nodes do not see that table property change at
>>> all. This is perhaps even worse than on 4.1.1, because the alter table
>>> statement is silently failing.
>>> 
>>> A shell script for reproducing the above described behaviours, and the
>>> output on both 4.1.1 and 4.1.6 are attached.
>>> 
>>> (as a good security practice, please always read and understand the full
>>> script you downloaded from untrusted sources before attempting to run it)
>>> 
>>> So, are these bugs? Or is this some kind of behaviour that's documented
>>> but I failed to find that documentation for?
>>> 
>>> Cheers,
>>> Bowen
>

Re: CDC and schema disagreement

2024-09-23 Thread Josh McKenzie

I wouldn't be surprised if we don't have logic in place to handle that disjoint 
(DDL for disabled .yaml property) at least in the case of CDC. It's been the 
better part of a decade since that first impl but I don't have any recollection 
of that logic being in there.

Let me take a quick look at the code and I'll get back to you; might need a 
JIRA.

On Fri, Sep 20, 2024, at 11:50 AM, Štefan Miklošovič wrote:
> Thank you for reporting this. I may check next week more closely and let you 
> know.
> 
> On Fri, Sep 20, 2024 at 5:43 PM Bowen Song via user 
>  wrote:
>> Hi all,
>> 
>> I suspect that I've ran into a bug (or two).
>> 
>> On Cassandra 4.1.1, when `cdc_enabled` in the cassandra.yaml file is set 
>> to `false` on at least one node in the cluster, and then the `ALTER 
>> TABLE ... WITH cdc=...` statement was run against that node, the cluster 
>> will end up in the schema disagreement state. At this stage, a rolling 
>> restart will bring the schema back in sync, but the changes made to the 
>> `cdc` table property will be lost.
>> 
>> On Cassandra 4.1.6, the same procedure doesn't cause visible schema 
>> disagreement in the `nodetool describecluster` command's output, but the 
>> `ALTER TABLE` statement only has cosmetic effect on the node it is run. 
>> The node with `cdc_enabled` set to `false` will show the `cdc` table 
>> property has changed, but this does not affect its behaviour in any way. 
>> At the same time, other nodes do not see that table property change at 
>> all. This is perhaps even worse than on 4.1.1, because the alter table 
>> statement is silently failing.
>> 
>> A shell script for reproducing the above described behaviours, and the 
>> output on both 4.1.1 and 4.1.6 are attached.
>> 
>> (as a good security practice, please always read and understand the full 
>> script you downloaded from untrusted sources before attempting to run it)
>> 
>> So, are these bugs? Or is this some kind of behaviour that's documented 
>> but I failed to find that documentation for?
>> 
>> Cheers,
>> Bowen

Re: Cassandra 5 Upgrade - Storage Compatibility Modes

2024-12-17 Thread Josh McKenzie

It's kind of a shame we don't have rolling restart functionality built in to 
the database / sidecar. I know we've discussed that in the past.

+1 to Jon's question - clients (i.e. java driver, etc) should be able to handle 
disconnects gracefully and route to other coordinators leaving the 
application-facing symptom being a blip on latency. Are you seeing something 
else more painful, or is it more just not having the built-in tooling / 
instrumentation to make it a clean reproducible operation?

On Tue, Dec 17, 2024, at 2:24 PM, Jon Haddad wrote:
> Just curious, why is a rolling restart difficult?  Is it a tooling issue, 
> stability, just overall fear of messing with things?
> 
> You *should* be able to do a rolling restart without it being an issue.  I 
> look at this as a fundamental workflow that every C* operator should have 
> available, and you should be able to do them without there being any concern. 
> 
> Jon
> 
> 
> On 2024/12/17 16:01:06 Paul Chandler wrote:
> > All,
> > 
> > We are getting a lot of push back on the 3 stage process of going through 
> > the three compatibility modes to upgrade to Cassandra 5. This basically 
> > means 3 rolling restarts of a cluster, which will be difficult for some of 
> > our large multi DC clusters.
> > 
> > Having researched this, it looks like, if you are not going to create large 
> > TTL’s, it would be possible to go straight from C*4 to C*5 with SCM NONE. 
> > This seems to be the same as it would have been going from 4.0 -> 4.1
> > 
> > Is there any reason why this should not be done? Has anyone had experience 
> > of upgrading in this way?
> > 
> > Thanks 
> > 
> > Paul Chandler
> > 
> >  
>

Re: [External]Cassandra 5.0: Any Official Tests Supporting 'Free Performance Gains'

2025-03-20 Thread Josh McKenzie

You may find the charts on the following JIRAs interesting:
https://issues.apache.org/jira/browse/CASSANDRA-17240 
<https://issues.apache.org/jira/browse/CASSANDRA-17240?jql=project%20%3D%20CASSANDRA%20and%20text%20~%20%27trie%20memtables%27>

That covers the memtables. The combination of UCS (new compaction strategy), 
memtables, and trie indexes is covered a bit in this youtube video here: 
https://youtu.be/eKxj6s4vzmI?list=PLqcm6qE9lgKKls90MlpejceYUU_0qVnWa&t=2075

All told, Branimir's work here is a Big Deal. We really should invest the time 
in a blog post with more clarity on how impactful these changes are for data 
density and performance; thanks for raising this question as it helps clarify 
that.

~Josh

On Wed, Mar 19, 2025, at 10:35 AM, Jiri Steuer (EIT) wrote:
> Hi FMH,
>  
> I haven't seen these official tests and that was the reason I did these tests 
> with the official tools. Regards
>  
>J. Steuer
>  
> **
> This item's classification is Internal. It was created by and is in property 
> of EmbedIT. Do not distribute outside of the organization.
> 
> From: FMH  
> *Sent:* Wednesday, March 19, 2025 3:14 PM
> *To:* Cassandra Support-user 
> *Subject:* [External]Cassandra 5.0: Any Official Tests Supporting 'Free 
> Performance Gains'
> 
> 
>  
> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with 
> links and attachments.
> Please report all suspicious e-mails to helpd...@embedit.com
> 
>  
> As I'm evaluating to upgrade to C* 4 or 5, one statement caught my attention 
> for the 5 release 
> (https://cassandra.apache.org/_/blog/Apache-Cassandra-5.0-Announcement.html):
> "Trie Memtables and Trie SSTables These low-level optimizations yield 
> impressive gains in memory usage and storage efficiency, providing a "free" 
> performance"
>  
> I have only found a single doc show-casing empirical evidence for such 
> performance gains. As per this document, compared to version 4.1, C* 5 had ...
> - 38% better performance and 26% better response time for write operations
> - 12% better performance and 9% better response time for read operations
>  
> I'm just wondering if there has been any official test results supporting the 
> claim for 'free performance'.
>  
> I'm trying to corroborate the test results described above. 
>  
> https://www.linkedin.com/pulse/performance-comparison-between-cassandra-version-41-5-jiri-steuer-pxbtf/
>  
> 
> Thank you
>

Re: [Question]When does Cassandra support OpenJDK 20 and above？

2025-03-02 Thread Josh McKenzie

My hope is to have JDK21 support merged in before our next major which we'll 
likely push to release this calendar year. Work is tracked here: 
https://issues.apache.org/jira/browse/CASSANDRA-18831

At this point there's a handful of test failures to burn down but otherwise 
JDK21 support is largely complete.

On Sat, Mar 1, 2025, at 11:39 PM, xiongbei wrote:
> Hello, I am a developer from ZTE company. Our project is using your Cassandra 
> 4.1 series version, but the openjdk version still only supports jdk8 and 
> jdk11. I would like to ask when the new version of Cassandra will support 
> jdk20 and above. Thank you

Re: Recycled-Commitlogs

2025-07-02 Thread Josh McKenzie

I was perhaps a bit snarky with you there Marc; my apologies. There's a history 
of some tension between Cassandra and Scylla and a user coming from an 
unrelated community knowingly asking for free support / time / energy from an 
unrelated project while holding back that context rubbed me the wrong way.

In the future I'd advise leading with your context rather than holding back on 
"having a confession to make".

Also, for posterity: a quick google of "Scylla Recycled commit logs" turns up 
this as the first response: https://github.com/scylladb/scylladb/issues/11184

On Wed, Jul 2, 2025, at 4:26 AM, Dmitry Konstantinov wrote:
> Hi Marc,
> 
> The recycled commit logs functionality was used in very old versions of 
> Cassandra, but it was removed in version 2.2.0 back in 2015 (so this logic 
> was removed 10 years ago, and understandably, no one wants to do 
> archaeological research). The currently supported versions start from 4.0.x.
> 
> The idea behind recycled logs was to reuse commit log files to reduce file 
> system overhead, but in practice, it didn’t work well. It caused issues and 
> made the implementation unnecessarily complex, so the feature was eventually 
> removed.
> 
> 
> Here’s a good summary article about the Cassandra commit log design, which 
> also includes a section on recycled logic ("Segment Recycling"):
> https://cassandra.apache.org/_/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra.html
> 
> However, this knowledge about Cassandra likely won’t help with your current 
> issue — there’s no guarantee that ScyllaDB shares a similar design in this 
> area, and any bugs will be specific to ScyllaDB. So, I think the best next 
> step for you is to contact the ScyllaDB community directly.
> 
> 
> You can also check their existing GitHub issues — for example, here’s one 
> that describes similar symptoms:
> https://github.com/scylladb/scylladb/issues/11184
> 
> 
> Regards,
> Dmitry
> 
> 
> 
> On Wed, 2 Jul 2025 at 07:57, Marc Hoppins  wrote:
>> SO CASSANDRA NEVER HAD RECYCLED COMMIT LOGS? If not then I apologise for 
>> wasting everyone’s time. However if, for previous versions, recycled 
>> commitlogs were a part of Cassandra then thank you fr not sharing the 
>> knowledge of a Cassandra item.
>> __ __
>> Time to unsubscribe
>> __ __
>> *From:* Josh McKenzie  
>> *Sent:* Tuesday, July 1, 2025 1:47 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Recycled-Commitlogs
>> __ __
>> EXTERNAL
>>> Can anyone explain WHAT they are?
>> Yes Marc. The Scylla community. That's what we're trying to tell you.
>> __ __
>> As far as anyone on this list seems to think, this isn't related to Apache 
>> Cassandra.
>> __ __
>> On Tue, Jul 1, 2025, at 1:54 AM, Marc Hoppins wrote:____
>>> That aside. Can anyone explain WHAT they are? If they were done away with 
>>> in Cassandra then they must have been surplus to requirements. In which 
>>> case, why have them originally? They must have been there for some 
>>> reason.
>>> 
>>>  
>>> 
>>> *From:* Josh McKenzie  
>>> *Sent:* Monday, June 30, 2025 4:16 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Recycled-Commitlogs
>>> 
>>>  
>>> 
>>> EXTERNAL
>>> 
>>>> As it is (generally) Cassandra compatible I naturally assumed that these 
>>>> items were in both applications.
>>>> 
>>> For future reference - it's "generally CQL API compatible". That's the 
>>> extent of it.
>>> 
>>>  
>>> 
>>> It's analogous to driving a make/model of one car and taking it to a 
>>> completely different unrelated dealer for maintenance because they're both 
>>> cars. Just because the interface to use them (steering wheel, doors, 
>>> engine, 4 tires) is broadly the same, the internals are wildly 
>>> different.
>>> 
>>>  
>>> 
>>> On Sat, Jun 28, 2025, at 11:56 PM, Jeff Jirsa wrote:
>>> 
>>>> Yea we’re not gonna be able to help you
>>>> 
>>>>  
>>>> 
>>>> This sounds like a software defect but it’s not cassandra so we really do 
>>>> much 
>>>> 
>>>>  
>>>> 
>>>> On 2025/06/27 13:13:17 Marc Hoppins wrote:
>>>> 
>>>> > Well, I do have a confession to make. It is actually scyllaDB

Re: Recycled-Commitlogs

2025-06-30 Thread Josh McKenzie

> As it is (generally) Cassandra compatible I naturally assumed that these 
> items were in both applications.
For future reference - it's "generally CQL API compatible". That's the extent 
of it.

It's analogous to driving a make/model of one car and taking it to a completely 
different unrelated dealer for maintenance because they're both cars. Just 
because the interface to use them (steering wheel, doors, engine, 4 tires) is 
broadly the same, the internals are wildly different.

On Sat, Jun 28, 2025, at 11:56 PM, Jeff Jirsa wrote:
> Yea we’re not gonna be able to help you
> 
> This sounds like a software defect but it’s not cassandra so we really do 
> much 
> 
> On 2025/06/27 13:13:17 Marc Hoppins wrote:
> > Well, I do have a confession to make. It is actually scyllaDB and the 
> > latest version. As it is (generally) Cassandra compatible I naturally 
> > assumed that these items were in both applications.
> > 
> > Marc
> > 
> > From: Jeff Jirsa 
> > Sent: Thursday, June 26, 2025 7:35 PM
> > To: user@cassandra.apache.org
> > Cc: user@cassandra.apache.org
> > Subject: Re: Recycled-Commitlogs
> > 
> > EXTERNAL
> > What version of cassandra is this?
> > 
> > Recycling segments was a thing from like 1.1 to 2.2 but really very 
> > different in modern versions (and cdc / point in time backup mirrors some 
> > of the concepts around hanging onto segments)
> > 
> > Knowing the version would be super helpful though
> > 
> > Is this … 1.2? 2.0?
> > 
> > 
> > 
> > On Jun 26, 2025, at 1:22 AM, guo Maxwell 
> > mailto:cclive1...@gmail.com>> wrote:
> > 
> > I guess it comes from the archive of commitlogs ，just guess~~~
> > 
> > But I think we need the cassandra's  version and commitlog's configuration 
> > in cassandra.yaml, and commitlog_archiving.properties to determine this.
> > 
> > Marc Hoppins mailto:marc.hopp...@eset.com>> 
> > 于2025年6月26日周四 16:08写道：
> > Hi,
> > 
> > I am not a data person but a Linux admin.  One of our nodes has thousands of
> > 
> > -rw-r--r-- 1 root root 33554432 Jun 24 15:11 
> > Recycled-CommitLog-2-67041997483.log
> > 
> > hanging around. Eventually they fill the filesystem. I have searched around 
> > and can find no mention of these recycled commits.
> > 
> > Can anyone explain what they are for?   Can I purge these in some graceful 
> > fashion with a service restart, a simple deletion, or a complete 
> > drain/restart of the node?
> > 
> > Thanks
> > 
> > Marc
> > 
>

Re: Recycled-Commitlogs

2025-06-26 Thread Josh McKenzie

> cdc / point in time backup mirrors some of the concepts around hanging onto 
> segments
That was my first thought but we never prepended "Recycled-" on the front of 
that that I know of.

On Thu, Jun 26, 2025, at 1:35 PM, Jeff Jirsa wrote:
> 
> What version of cassandra is this? 
> 
> Recycling segments was a thing from like 1.1 to 2.2 but really very different 
> in modern versions (and cdc / point in time backup mirrors some of the 
> concepts around hanging onto segments)
> 
> Knowing the version would be super helpful though 
> 
> Is this … 1.2? 2.0?
> 
> 
>> On Jun 26, 2025, at 1:22 AM, guo Maxwell  wrote:
>> 
>> I guess it comes from the archive of commitlogs ，just guess~~~
>> 
>> But I think we need the cassandra's  version and commitlog's configuration 
>> in cassandra.yaml, and commitlog_archiving.properties to determine this.
>> 
>> Marc Hoppins  于2025年6月26日周四 16:08写道：
>>> Hi,
>>> __ __
>>> I am not a data person but a Linux admin.  One of our nodes has thousands 
>>> of
>>> __ __
>>> -rw-r--r-- 1 root root 33554432 Jun 24 15:11 
>>> Recycled-CommitLog-2-67041997483.log
>>> __ __
>>> hanging around. Eventually they fill the filesystem. I have searched around 
>>> and can find no mention of these recycled commits.
>>> __ __
>>> Can anyone explain what they are for?   Can I purge these in some graceful 
>>> fashion with a service restart, a simple deletion, or a complete 
>>> drain/restart of the node?
>>> __ __
>>> Thanks
>>> __ __
>>> Marc

Re: Recycled-Commitlogs

2025-07-01 Thread Josh McKenzie

> Can anyone explain WHAT they are?
Yes Marc. The Scylla community. That's what we're trying to tell you.

As far as anyone on this list seems to think, this isn't related to Apache 
Cassandra.

On Tue, Jul 1, 2025, at 1:54 AM, Marc Hoppins wrote:
> That aside. Can anyone explain WHAT they are? If they were done away with in 
> Cassandra then they must have been surplus to requirements. In which case, 
> why have them originally? They must have been there for some reason.
>  
> *From:* Josh McKenzie  
> *Sent:* Monday, June 30, 2025 4:16 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Recycled-Commitlogs
>  
> EXTERNAL
>> As it is (generally) Cassandra compatible I naturally assumed that these 
>> items were in both applications.
> For future reference - it's "generally CQL API compatible". That's the extent 
> of it.
>  
> It's analogous to driving a make/model of one car and taking it to a 
> completely different unrelated dealer for maintenance because they're both 
> cars. Just because the interface to use them (steering wheel, doors, engine, 
> 4 tires) is broadly the same, the internals are wildly different.
>  
> On Sat, Jun 28, 2025, at 11:56 PM, Jeff Jirsa wrote:
>> Yea we’re not gonna be able to help you
>>  
>> This sounds like a software defect but it’s not cassandra so we really do 
>> much 
>>  
>> On 2025/06/27 13:13:17 Marc Hoppins wrote:
>> > Well, I do have a confession to make. It is actually scyllaDB and the 
>> > latest version. As it is (generally) Cassandra compatible I naturally 
>> > assumed that these items were in both applications.
>> > 
>> > Marc
>> > 
>> > From: Jeff Jirsa 
>> > Sent: Thursday, June 26, 2025 7:35 PM
>> > To: user@cassandra.apache.org
>> > Cc: user@cassandra.apache.org
>> > Subject: Re: Recycled-Commitlogs
>> > 
>> > EXTERNAL
>> > What version of cassandra is this?
>> > 
>> > Recycling segments was a thing from like 1.1 to 2.2 but really very 
>> > different in modern versions (and cdc / point in time backup mirrors some 
>> > of the concepts around hanging onto segments)
>> > 
>> > Knowing the version would be super helpful though
>> > 
>> > Is this … 1.2? 2.0?
>> > 
>> > 
>> > 
>> > On Jun 26, 2025, at 1:22 AM, guo Maxwell 
>> > mailto:cclive1...@gmail.com>> wrote:
>> > 
>> > I guess it comes from the archive of commitlogs ，just guess~~~
>> > 
>> > But I think we need the cassandra's  version and commitlog's configuration 
>> > in cassandra.yaml, and commitlog_archiving.properties to determine this.
>> > 
>> > Marc Hoppins mailto:marc.hopp...@eset.com>> 
>> > 于2025年6月26日周四 16:08写道：
>> > Hi,
>> > 
>> > I am not a data person but a Linux admin.  One of our nodes has thousands 
>> > of
>> > 
>> > -rw-r--r-- 1 root root 33554432 Jun 24 15:11 
>> > Recycled-CommitLog-2-67041997483.log
>> > 
>> > hanging around. Eventually they fill the filesystem. I have searched 
>> > around and can find no mention of these recycled commits.
>> > 
>> > Can anyone explain what they are for?   Can I purge these in some graceful 
>> > fashion with a service restart, a simple deletion, or a complete 
>> > drain/restart of the node?
>> > 
>> > Thanks
>> > 
>> > Marc
>> > 
>>  
>

Re: Cassandra Meetup – Hosted by Uber and the Apache Cassandra Community (Aug 12, Sunnyvale, CA)

2025-07-23 Thread Josh McKenzie

> I'll be there, and I hope if you are in the area, you don't miss this 
> opportunity for us to connect. 
Do not arm wrestle Patrick. Unless you like doing PT afterward.

I'll be there; have some things to present and chat about. :D

Looking forward to it!

On Tue, Jul 22, 2025, at 6:48 PM, Jaydeep Chovatia wrote:
> Sure, Alex. 
> Let me know if somebody else living in the Bay Area wants to own it; 
> otherwise, I am happy to own it :)
> 
> Jaydeep
> 
> On Tue, Jul 22, 2025 at 1:38 PM Alex Petrov  wrote:
>> __
>> Hi Jaydeep,
>> 
>> It so happened that I (despite not being physically in US) "inherited" Bay 
>> Area Cassandra Meetup page. If you are interested, please let me know and I 
>> can either transfer ownership to you or just mirror this meetup's data for 
>> the existing Cassandra audience.
>> 
>> --Alex
>> 
>> On Tue, Jul 22, 2025, at 9:42 PM, Jaydeep Chovatia wrote:
>>> Dear Apache Cassandra dev@ and user@,
>>> 
>>> There is an in-person-only Apache Cassandra meetup hosted by Uber in 
>>> collaboration with the broader Cassandra community! Encourage everyone in 
>>> the Bay Area to take advantage of this opportunity to connect with experts 
>>> and learn from the topics covered.
>>> 
>>> 📍 Location: Sunnyvale, CA 
>>> 📅 Date: August 12, 2025 
>>> 🕠 Time: 5:30 PM – 8:30 PM PDT 
>>> 🔗 RSVP & Details: https://www.meetup.com/uberevents/events/310016446/
>>> 🔗 LinkedIn Event Post: 
>>> https://www.linkedin.com/feed/update/urn:li:activity:7353102746624970752
>>> 
>>> Jaydeep
>>

63 matches

Mail list logo