cql-maven-plugin

2016-10-07 Thread Brice Dutheil
Hi there,

I’d like to share a very simple project around handling CQL files with
maven. We were using the cassandra-maven-plugin before, but with
limitations on the authentication and the use of thrift protocol. I was
tempted to write a replacement focused only the execution of CQL
statements, in the same way that sql-maven-plugin is.

I didn’t port the cassandra lifecycle tasks as they could be handled with
other tool. e.g. docker

It’s available on maven at the following coordinate
com.github.bric3.maven:cql-maven-plugin:0.4, and the code is available on
Github https://github.com/bric3/cql-maven-plugin

I would definitely like feedback on this. It’s probably not bug free, but
our team uses this plugin with several projects, that are each built
several times a day.
Currently the code is lacking integration tests, that’s probably the area
where it can be improved the most.

Cheers,
— Brice
​


Re: Rationale for using Hazelcast in front of Cassandra?

2016-10-07 Thread Dorian Hoxha
Primary-key select is pretty fast in rdbms too and they also have caches.
By "close to" you mean in latency ?
Have you thought why people don't use cassandra as a cache ? While it
doesn't have LRU, it has TTL,replicatio,sharding.

On Fri, Oct 7, 2016 at 12:00 AM, KARR, DAVID  wrote:

> Clearly, with “traditional” RDBMSs, you tend to put a cache “close to” the
> client.  However, I was under the impression that Cassandra nodes could be
> positioned “close to” their clients, and Cassandra has its own cache (I
> believe), so how effective would it be to put a cache in front of a cache?
>
>
>
> *From:* Dorian Hoxha [mailto:dorian.ho...@gmail.com]
> *Sent:* Thursday, October 06, 2016 2:52 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rationale for using Hazelcast in front of Cassandra?
>
>
>
> Maybe when you can have very hot keys that can give trouble to your
> 3(replication) cassandra nodes ?
>
> Example: why does facebook use memcache ? They certainly have things
> distributed on thousands of servers.
>
>
>
> On Thu, Oct 6, 2016 at 11:40 PM, KARR, DAVID  wrote:
>
> I've seen use cases that briefly describe using Hazelcast as a "front-end"
> for Cassandra, perhaps as a cache.  This seems counterintuitive to me.  Can
> someone describe to me when this kind of architecture might make sense?
>
>
>


Re: cql-maven-plugin

2016-10-07 Thread Ali Akhtar
Is there a way to call this programatically such as from unit tests, to
create keyspace / table schema from a cql file?

On Fri, Oct 7, 2016 at 2:40 PM, Brice Dutheil 
wrote:

> Hi there,
>
> I’d like to share a very simple project around handling CQL files with
> maven. We were using the cassandra-maven-plugin before, but with
> limitations on the authentication and the use of thrift protocol. I was
> tempted to write a replacement focused only the execution of CQL
> statements, in the same way that sql-maven-plugin is.
>
> I didn’t port the cassandra lifecycle tasks as they could be handled with
> other tool. e.g. docker
>
> It’s available on maven at the following coordinate
> com.github.bric3.maven:cql-maven-plugin:0.4, and the code is available on
> Github https://github.com/bric3/cql-maven-plugin
>
> I would definitely like feedback on this. It’s probably not bug free, but
> our team uses this plugin with several projects, that are each built
> several times a day.
> Currently the code is lacking integration tests, that’s probably the area
> where it can be improved the most.
>
> Cheers,
> — Brice
> ​
>


Re: cql-maven-plugin

2016-10-07 Thread Brice Dutheil
At this moment no, as this is a maven plugin. Extracting such code would be
relatively trivial.

-- Brice

On Fri, Oct 7, 2016 at 1:24 PM, Ali Akhtar  wrote:

> Is there a way to call this programatically such as from unit tests, to
> create keyspace / table schema from a cql file?
>
> On Fri, Oct 7, 2016 at 2:40 PM, Brice Dutheil 
> wrote:
>
>> Hi there,
>>
>> I’d like to share a very simple project around handling CQL files with
>> maven. We were using the cassandra-maven-plugin before, but with
>> limitations on the authentication and the use of thrift protocol. I was
>> tempted to write a replacement focused only the execution of CQL
>> statements, in the same way that sql-maven-plugin is.
>>
>> I didn’t port the cassandra lifecycle tasks as they could be handled with
>> other tool. e.g. docker
>>
>> It’s available on maven at the following coordinate
>> com.github.bric3.maven:cql-maven-plugin:0.4, and the code is available
>> on Github https://github.com/bric3/cql-maven-plugin
>>
>> I would definitely like feedback on this. It’s probably not bug free, but
>> our team uses this plugin with several projects, that are each built
>> several times a day.
>> Currently the code is lacking integration tests, that’s probably the area
>> where it can be improved the most.
>>
>> Cheers,
>> — Brice
>> ​
>>
>
>


Re: Running Cassandra in Integration Tests

2016-10-07 Thread Eric Stevens
If you happen to be using Scala, we recently released some tooling we wrote
around using CCM for integration testing:
https://github.com/protectwise/cassandra-util

You define clusters and nodes in configuration, then ask the service to go:
https://github.com/protectwise/cassandra-util/blob/master/ccm-testing-helper/src/main/scala/com/protectwise/testing/ccm/CassandraSetup.scala#L147

It'll create your clusters and tear them down automatically when execution
completes.

On Thu, Oct 6, 2016 at 11:50 PM Edward Capriolo 
wrote:

> Checkout https://github.com/edwardcapriolo/farsandra. It falls under the
> realm of almost 100% pure java (besides the fact it uses some shell to
> launch Cassandra).
>
> On Thu, Oct 6, 2016 at 7:08 PM, Ali Akhtar  wrote:
>
> Is it possible to create an isolated cassandra instance which is run
> during integration tests and it disappears after tests have finished
> running? Then its recreated the next time tests run (perhaps being
> populated with test data).
>
>  I'm using Java.
>
>
>
>


RE: Rationale for using Hazelcast in front of Cassandra?

2016-10-07 Thread KARR, DAVID
No, I haven’t “thought why people don’t use Cassandra as a cache”, that’s why 
I’m asking this here.  I’m asking the community for their POV when it might 
make sense to front Cassandra with Hazelcast.  This is even mentioned as a use 
case in the Hazelcast documentation (“As a front layer for a Cassandra 
back-end”), and I’m aware of at least one large private enterprise that does 
this.

From: Dorian Hoxha [mailto:dorian.ho...@gmail.com]
Sent: Friday, October 07, 2016 3:48 AM
To: user@cassandra.apache.org
Subject: Re: Rationale for using Hazelcast in front of Cassandra?

Primary-key select is pretty fast in rdbms too and they also have caches. By 
"close to" you mean in latency ?
Have you thought why people don't use cassandra as a cache ? While it doesn't 
have LRU, it has TTL,replicatio,sharding.

On Fri, Oct 7, 2016 at 12:00 AM, KARR, DAVID 
mailto:dk0...@att.com>> wrote:
Clearly, with “traditional” RDBMSs, you tend to put a cache “close to” the 
client.  However, I was under the impression that Cassandra nodes could be 
positioned “close to” their clients, and Cassandra has its own cache (I 
believe), so how effective would it be to put a cache in front of a cache?

From: Dorian Hoxha 
[mailto:dorian.ho...@gmail.com]
Sent: Thursday, October 06, 2016 2:52 PM
To: user@cassandra.apache.org
Subject: Re: Rationale for using Hazelcast in front of Cassandra?

Maybe when you can have very hot keys that can give trouble to your 
3(replication) cassandra nodes ?
Example: why does facebook use memcache ? They certainly have things 
distributed on thousands of servers.

On Thu, Oct 6, 2016 at 11:40 PM, KARR, DAVID 
mailto:dk0...@att.com>> wrote:
I've seen use cases that briefly describe using Hazelcast as a "front-end" for 
Cassandra, perhaps as a cache.  This seems counterintuitive to me.  Can someone 
describe to me when this kind of architecture might make sense?




Re: Rationale for using Hazelcast in front of Cassandra?

2016-10-07 Thread Peter Lin
Cassandra is a database, not an in-memory cache. Please don't abuse
Cassandra like that when there's plenty of existing distributed cache
products designed for that purpose.

That's like asking "why can't I drag race with a school bus?"

You could and it might be fun, but that's not what it was designed for.

On Fri, Oct 7, 2016 at 11:22 AM, KARR, DAVID  wrote:

> No, I haven’t “thought why people don’t use Cassandra as a cache”, that’s
> why I’m asking this here.  I’m asking the community for their POV when it
> might make sense to front Cassandra with Hazelcast.  This is even mentioned
> as a use case in the Hazelcast documentation (“As a front layer for a
> Cassandra back-end”), and I’m aware of at least one large private
> enterprise that does this.
>
>
>
> *From:* Dorian Hoxha [mailto:dorian.ho...@gmail.com]
> *Sent:* Friday, October 07, 2016 3:48 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rationale for using Hazelcast in front of Cassandra?
>
>
>
> Primary-key select is pretty fast in rdbms too and they also have caches.
> By "close to" you mean in latency ?
>
> Have you thought why people don't use cassandra as a cache ? While it
> doesn't have LRU, it has TTL,replicatio,sharding.
>
>
>
> On Fri, Oct 7, 2016 at 12:00 AM, KARR, DAVID  wrote:
>
> Clearly, with “traditional” RDBMSs, you tend to put a cache “close to” the
> client.  However, I was under the impression that Cassandra nodes could be
> positioned “close to” their clients, and Cassandra has its own cache (I
> believe), so how effective would it be to put a cache in front of a cache?
>
>
>
> *From:* Dorian Hoxha [mailto:dorian.ho...@gmail.com]
> *Sent:* Thursday, October 06, 2016 2:52 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Rationale for using Hazelcast in front of Cassandra?
>
>
>
> Maybe when you can have very hot keys that can give trouble to your
> 3(replication) cassandra nodes ?
>
> Example: why does facebook use memcache ? They certainly have things
> distributed on thousands of servers.
>
>
>
> On Thu, Oct 6, 2016 at 11:40 PM, KARR, DAVID  wrote:
>
> I've seen use cases that briefly describe using Hazelcast as a "front-end"
> for Cassandra, perhaps as a cache.  This seems counterintuitive to me.  Can
> someone describe to me when this kind of architecture might make sense?
>
>
>
>
>


Re: Rationale for using Hazelcast in front of Cassandra?

2016-10-07 Thread Benjamin Roth
@Peter: Thanks for that comment! Thats pretty much what I thought when
reading the phrase why not to use CS as a cache.

Thoughts to sth in front of sth else:
If your real world case requires more performance, one option is always to
add a cache in front of it. How much overall gain you have from it
completely depends on your model, your infrastructure, your services (CS,
Memcache, Hazelcast, whatsoever), your demands on availability,
consistency, latency and so on.

There is no wrong or false.

Maybe you get 50% better performance with a Memcache in front of CS, or you
just use ScyllaDB and throw the memcache away. Or Memcache is not fail-safe
enough or your cache needs replication, then you maybe need sth like
Hazelcast.
Can your App-Model deal with Caches and its invalidation? Or will stale
caches be a problem in your app?

These are question that should drive a decision. But at the end, every
single case is different and has to be benchmarked and analyzed separately.

2016-10-07 17:28 GMT+02:00 Peter Lin :

>
> Cassandra is a database, not an in-memory cache. Please don't abuse
> Cassandra like that when there's plenty of existing distributed cache
> products designed for that purpose.
>
> That's like asking "why can't I drag race with a school bus?"
>
> You could and it might be fun, but that's not what it was designed for.
>
> On Fri, Oct 7, 2016 at 11:22 AM, KARR, DAVID  wrote:
>
>> No, I haven’t “thought why people don’t use Cassandra as a cache”, that’s
>> why I’m asking this here.  I’m asking the community for their POV when it
>> might make sense to front Cassandra with Hazelcast.  This is even mentioned
>> as a use case in the Hazelcast documentation (“As a front layer for a
>> Cassandra back-end”), and I’m aware of at least one large private
>> enterprise that does this.
>>
>>
>>
>> *From:* Dorian Hoxha [mailto:dorian.ho...@gmail.com]
>> *Sent:* Friday, October 07, 2016 3:48 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Rationale for using Hazelcast in front of Cassandra?
>>
>>
>>
>> Primary-key select is pretty fast in rdbms too and they also have caches.
>> By "close to" you mean in latency ?
>>
>> Have you thought why people don't use cassandra as a cache ? While it
>> doesn't have LRU, it has TTL,replicatio,sharding.
>>
>>
>>
>> On Fri, Oct 7, 2016 at 12:00 AM, KARR, DAVID  wrote:
>>
>> Clearly, with “traditional” RDBMSs, you tend to put a cache “close to”
>> the client.  However, I was under the impression that Cassandra nodes could
>> be positioned “close to” their clients, and Cassandra has its own cache (I
>> believe), so how effective would it be to put a cache in front of a cache?
>>
>>
>>
>> *From:* Dorian Hoxha [mailto:dorian.ho...@gmail.com]
>> *Sent:* Thursday, October 06, 2016 2:52 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Rationale for using Hazelcast in front of Cassandra?
>>
>>
>>
>> Maybe when you can have very hot keys that can give trouble to your
>> 3(replication) cassandra nodes ?
>>
>> Example: why does facebook use memcache ? They certainly have things
>> distributed on thousands of servers.
>>
>>
>>
>> On Thu, Oct 6, 2016 at 11:40 PM, KARR, DAVID  wrote:
>>
>> I've seen use cases that briefly describe using Hazelcast as a
>> "front-end" for Cassandra, perhaps as a cache.  This seems counterintuitive
>> to me.  Can someone describe to me when this kind of architecture might
>> make sense?
>>
>>
>>
>>
>>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer


JVM safepoints, mmap, and slow disks

2016-10-07 Thread Josh Snyder
Hello cassandra-users,

I'm investigating an issue with JVMs taking a while to reach a safepoint.  I'd
like the list's input on confirming my hypothesis and finding mitigations.

My hypothesis is that slow block devices are causing Cassandra's JVM to pause
completely while attempting to reach a safepoint.

Background:

Hotspot occasionally performs maintenance tasks that necessitate stopping all
of its threads. Threads running JITed code occasionally read from a given
safepoint page. If Hotspot has initiated a safepoint, reading from that page
essentially catapults the thread into purgatory until the safepoint completes
(the mechanism behind this is pretty cool). Threads performing syscalls or
executing native code do this check upon their return into the JVM.

In this way, during the safepoint Hotspot can be sure that all of its threads
are either patiently waiting for safepoint completion or in a system call.

Cassandra makes heavy use of mmapped reads in normal operation. When doing
mmapped reads, the JVM executes userspace code to effect a read from a file. On
the fast path (when the page needed is already mapped into the process), this
instruction is very fast. When the page is not cached, the CPU triggers a page
fault and asks the OS to go fetch the page. The JVM doesn't even realize that
anything interesting is happening: to it, the thread is just executing a mov
instruction that happens to take a while.

The OS, meanwhile, puts the thread in question in the D state (assuming Linux,
here) and goes off to find the desired page. This may take microseconds, this
may take milliseconds, or it may take seconds (or longer). When I/O occurs
while the JVM is trying to enter a safepoint, every thread has to wait for the
laggard I/O to complete.

If you log safepoints with the right options [1], you can see these occurrences
in the JVM output:

> # SafepointSynchronize::begin: Timeout detected:
> # SafepointSynchronize::begin: Timed out while spinning to reach a safepoint.
> # SafepointSynchronize::begin: Threads which did not reach the safepoint:
> # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 tid=0x7f8785bb1f30 
> nid=0x4e14 runnable [0x]
>java.lang.Thread.State: RUNNABLE
>
> # SafepointSynchronize::begin: (End of list)
>  vmop[threads: total initially_running 
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 58099.941: G1IncCollectionPause [ 447  1  
> 1]  [  3304 0  3305 1   190]  1

If that safepoint happens to be a garbage collection (which this one was), you
can also see it in GC logs:

> 2016-10-07T13:19:50.029+: 58103.440: Total time for which application 
> threads were stopped: 3.4971808 seconds, Stopping threads took: 3.3050644 
> seconds

In this way, JVM safepoints become a powerful weapon for transmuting a single
thread's slow I/O into the entire JVM's lockup.

Does all of the above sound correct?

Mitigations:

1) don't tolerate block devices that are slow

This is easy in theory, and only somewhat difficult in practice. Tools like
perf and iosnoop [2] can do pretty good jobs of letting you know when a block
device is slow.

It is sad, though, because this makes running Cassandra on mixed hardware (e.g.
fast SSD and slow disks in a JBOD) quite unappetizing.

2) have fewer safepoints

Two of the biggest sources of safepoints are garbage collection and revocation
of biased locks. Evidence points toward biased locking being unhelpful for
Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a quick way
to eliminate one source of safepoints.

Garbage collection, on the other hand, is unavoidable. Running with increased
heap size would reduce GC frequency, at the cost of page cache. But sacrificing
page cache would increase page fault frequency, which is another thing we're
trying to avoid! I don't view this as a serious option.

3) use a different IO strategy

Looking at the Cassandra source code, there appears to be an un(der)documented
configuration parameter called disk_access_mode. It appears that changing this
to 'standard' would switch to using pread() and pwrite() for I/O, instead of
mmap. I imagine there would be a throughput penalty here for the case when
pages are in the disk cache.

Is this a serious option? It seems far too underdocumented to be thought of as
a contender.

4) modify the JVM

This is a longer term option. For the purposes of safepoints, perhaps the JVM
could treat reads from an mmapped file in the same way it treats threads that
are running JNI code. That is, the safepoint will proceed even though the
reading thread has not "joined in". Upon finishing its mmapped read, the
reading thread would test the safepoint page (check whether a safepoint is in
progress, in other words).

Conclusion:

I don't imagine there's an easy solution here. I plan to go ahead with
mitigation #1: "don't tolerate block devices that are slo

Re: JVM safepoints, mmap, and slow disks

2016-10-07 Thread Vladimir Yudovin
Hi Josh,

>Running with increased heap size would reduce GC frequency, at the cost of 
page cache.

Actually it's recommended to run C* without virtual memory enabled. So if there 
is no enough memory JVM fails instead of blocking

Best regards, Vladimir Yudovin, 
Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.




 On Fri, 07 Oct 2016 21:06:24 -0400 Josh Snyder 
wrote  

Hello cassandra-users, 
 
I'm investigating an issue with JVMs taking a while to reach a safepoint. I'd 
like the list's input on confirming my hypothesis and finding mitigations. 
 
My hypothesis is that slow block devices are causing Cassandra's JVM to pause 
completely while attempting to reach a safepoint. 
 
Background: 
 
Hotspot occasionally performs maintenance tasks that necessitate stopping all 
of its threads. Threads running JITed code occasionally read from a given 
safepoint page. If Hotspot has initiated a safepoint, reading from that page 
essentially catapults the thread into purgatory until the safepoint completes 
(the mechanism behind this is pretty cool). Threads performing syscalls or 
executing native code do this check upon their return into the JVM. 
 
In this way, during the safepoint Hotspot can be sure that all of its threads 
are either patiently waiting for safepoint completion or in a system call. 
 
Cassandra makes heavy use of mmapped reads in normal operation. When doing 
mmapped reads, the JVM executes userspace code to effect a read from a file. On 
the fast path (when the page needed is already mapped into the process), this 
instruction is very fast. When the page is not cached, the CPU triggers a page 
fault and asks the OS to go fetch the page. The JVM doesn't even realize that 
anything interesting is happening: to it, the thread is just executing a mov 
instruction that happens to take a while. 
 
The OS, meanwhile, puts the thread in question in the D state (assuming Linux, 
here) and goes off to find the desired page. This may take microseconds, this 
may take milliseconds, or it may take seconds (or longer). When I/O occurs 
while the JVM is trying to enter a safepoint, every thread has to wait for the 
laggard I/O to complete. 
 
If you log safepoints with the right options [1], you can see these occurrences 
in the JVM output: 
 
> # SafepointSynchronize::begin: Timeout detected: 
> # SafepointSynchronize::begin: Timed out while spinning to reach a 
safepoint. 
> # SafepointSynchronize::begin: Threads which did not reach the safepoint: 
> # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 
tid=0x7f8785bb1f30 nid=0x4e14 runnable [0x] 
> java.lang.Thread.State: RUNNABLE 
> 
> # SafepointSynchronize::begin: (End of list) 
> vmop [threads: total initially_running wait_to_block] [time: spin block 
sync cleanup vmop] page_trap_count 
> 58099.941: G1IncCollectionPause [ 447 1 1 ] [ 3304 0 3305 1 190 ] 1 
 
If that safepoint happens to be a garbage collection (which this one was), you 
can also see it in GC logs: 
 
> 2016-10-07T13:19:50.029+: 58103.440: Total time for which application 
threads were stopped: 3.4971808 seconds, Stopping threads took: 3.3050644 
seconds 
 
In this way, JVM safepoints become a powerful weapon for transmuting a single 
thread's slow I/O into the entire JVM's lockup. 
 
Does all of the above sound correct? 
 
Mitigations: 
 
1) don't tolerate block devices that are slow 
 
This is easy in theory, and only somewhat difficult in practice. Tools like 
perf and iosnoop [2] can do pretty good jobs of letting you know when a block 
device is slow. 
 
It is sad, though, because this makes running Cassandra on mixed hardware (e.g. 
fast SSD and slow disks in a JBOD) quite unappetizing. 
 
2) have fewer safepoints 
 
Two of the biggest sources of safepoints are garbage collection and revocation 
of biased locks. Evidence points toward biased locking being unhelpful for 
Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a quick way 
to eliminate one source of safepoints. 
 
Garbage collection, on the other hand, is unavoidable. Running with increased 
heap size would reduce GC frequency, at the cost of page cache. But sacrificing 
page cache would increase page fault frequency, which is another thing we're 
trying to avoid! I don't view this as a serious option. 
 
3) use a different IO strategy 
 
Looking at the Cassandra source code, there appears to be an un(der)documented 
configuration parameter called disk_access_mode. It appears that changing this 
to 'standard' would switch to using pread() and pwrite() for I/O, instead of 
mmap. I imagine there would be a throughput penalty here for the case when 
pages are in the disk cache. 
 
Is this a serious option? It seems far too underdocumented to be thought of as 
a contender. 
 
4) modify the JVM 
 
This is a longer term option. For the purposes of safepoints, perhaps the JVM 
could treat reads from an mmapped file in