com.datastax.driver.core.exceptions.SyntaxError: line 1:37 no viable alternative at character

2015-06-24 Thread Serega Sheypak
Hi, I'm trying to use bounded query and I get  weird error:

Here is a query:

Bounded query: INSERT INTO packets (id, fingerprint, mark) VALUES (?, ?, ?);


Here is a code:

PreparedStatement preparedStatement = session.prepare(composeQuery());
//composeQuery returns INSERT INTO packets (id, fingerprint, mark) VALUES
(?, ?, ?);

BoundStatement boundStatement = new BoundStatement(preparedStatement);
//EXCEPTION HERE

boundStatement.bind(UUID.randomUUID(), RandomStringUtils.random(10), 1);

session.execute(boundStatement);
If I use cqlsh and run

INSERT INTO packets (id, fingerprint, mark) VALUES (now(), 'xxx', 1);
it works

Stacktrace:

Exception in thread "main" com.datastax.driver.core.exceptions.SyntaxError:
line 1:37 no viable alternative at character ' '

at com.datastax.driver.core.exceptions.SyntaxError.copy(SyntaxError.java:35)

at
com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)

at com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:79)

at stress.StressTest.runBound(StressTest.java:89)

at stress.Main.main(Main.java:29)

Caused by: com.datastax.driver.core.exceptions.SyntaxError: line 1:37 no
viable alternative at character ' '

at com.datastax.driver.core.Responses$Error.asException(Responses.java:101)

at com.datastax.driver.core.SessionManager$1.apply(SessionManager.java:185)

at com.datastax.driver.core.SessionManager$1.apply(SessionManager.java:160)

at com.google.common.util.concurrent.Futures$1.apply(Futures.java:720)

at
com.google.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:859)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)


Re: Counters 2.1 Accuracy

2015-06-24 Thread Phil Yang
IMO, the main concern of C*'s counter is, it is not idempotent. For
example, if you add a counter and get a timeout error, you can not know
whether it is successful. For non-counter writes, they are idempotent so
you can just retry, but if you retry in counter, there may be a double
write.

2015-06-23 12:23 GMT+08:00 Mike Trienis :

>
> Hi All,
>
> I'm fairly new to Cassandra and am planning on using it as a datastore for
> an Apache Spark cluster.
>
> The use case is fairly simple, read the raw data and perform aggregates
> and push the rolled up data back to Cassandra. The data models will use
> counters pretty heavily so I'd like to understand what kind of accuracy
> should I expect from Cassandra 2.1 when increment the counters.
>
>-
>
> http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters
>
> The blog post above states that the new counter implementations are
> "safer" although I'm not sure what that means in practice. Will the
> counters be 99.99% accurate? How often will they be over or under counted?
>
> Thanks, Mike.
>



-- 
Thanks,
Phil Yang


Re: com.datastax.driver.core.exceptions.SyntaxError: line 1:37 no viable alternative at character

2015-06-24 Thread Serega Sheypak
Sorry, misprint
//composeQuery() => INSERT INTO packets (id, fingerprint, mark) VALUES (?,
?, ?);
PreparedStatement preparedStatement = session.prepare(composeQuery());
//exception happens here!

2015-06-24 11:20 GMT+02:00 Serega Sheypak :

> Hi, I'm trying to use bounded query and I get  weird error:
>
> Here is a query:
>
> Bounded query: INSERT INTO packets (id, fingerprint, mark) VALUES (?, ?,
> ?);
>
>
> Here is a code:
>
> PreparedStatement preparedStatement = session.prepare(composeQuery());
> //composeQuery returns INSERT INTO packets (id, fingerprint, mark) VALUES
> (?, ?, ?);
>
> BoundStatement boundStatement = new BoundStatement(preparedStatement);
> //EXCEPTION HERE
>
> boundStatement.bind(UUID.randomUUID(), RandomStringUtils.random(10), 1);
>
> session.execute(boundStatement);
> If I use cqlsh and run
>
> INSERT INTO packets (id, fingerprint, mark) VALUES (now(), 'xxx', 1);
> it works
>
> Stacktrace:
>
> Exception in thread "main"
> com.datastax.driver.core.exceptions.SyntaxError: line 1:37 no viable
> alternative at character ' '
>
> at
> com.datastax.driver.core.exceptions.SyntaxError.copy(SyntaxError.java:35)
>
> at
> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>
> at
> com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:79)
>
> at stress.StressTest.runBound(StressTest.java:89)
>
> at stress.Main.main(Main.java:29)
>
> Caused by: com.datastax.driver.core.exceptions.SyntaxError: line 1:37 no
> viable alternative at character ' '
>
> at com.datastax.driver.core.Responses$Error.asException(Responses.java:101)
>
> at com.datastax.driver.core.SessionManager$1.apply(SessionManager.java:185)
>
> at com.datastax.driver.core.SessionManager$1.apply(SessionManager.java:160)
>
> at com.google.common.util.concurrent.Futures$1.apply(Futures.java:720)
>
> at
> com.google.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:859)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
>  at java.lang.Thread.run(Thread.java:745)
>
>
>


Re: com.datastax.driver.core.exceptions.SyntaxError: line 1:37 no viable alternative at character

2015-06-24 Thread Serega Sheypak
omg!!!
It was some weird unprinted character. That is why C* driver failed to
parse it

2015-06-24 11:35 GMT+02:00 Serega Sheypak :

> Sorry, misprint
> //composeQuery() => INSERT INTO packets (id, fingerprint, mark) VALUES
> (?, ?, ?);
> PreparedStatement preparedStatement = session.prepare(composeQuery());
> //exception happens here!
>
> 2015-06-24 11:20 GMT+02:00 Serega Sheypak :
>
>> Hi, I'm trying to use bounded query and I get  weird error:
>>
>> Here is a query:
>>
>> Bounded query: INSERT INTO packets (id, fingerprint, mark) VALUES (?, ?,
>> ?);
>>
>>
>> Here is a code:
>>
>> PreparedStatement preparedStatement = session.prepare(composeQuery());
>> //composeQuery returns INSERT INTO packets (id, fingerprint, mark) VALUES
>> (?, ?, ?);
>>
>> BoundStatement boundStatement = new BoundStatement(preparedStatement);
>> //EXCEPTION HERE
>>
>> boundStatement.bind(UUID.randomUUID(), RandomStringUtils.random(10), 1);
>>
>> session.execute(boundStatement);
>> If I use cqlsh and run
>>
>> INSERT INTO packets (id, fingerprint, mark) VALUES (now(), 'xxx', 1);
>> it works
>>
>> Stacktrace:
>>
>> Exception in thread "main"
>> com.datastax.driver.core.exceptions.SyntaxError: line 1:37 no viable
>> alternative at character ' '
>>
>> at
>> com.datastax.driver.core.exceptions.SyntaxError.copy(SyntaxError.java:35)
>>
>> at
>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>>
>> at
>> com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:79)
>>
>> at stress.StressTest.runBound(StressTest.java:89)
>>
>> at stress.Main.main(Main.java:29)
>>
>> Caused by: com.datastax.driver.core.exceptions.SyntaxError: line 1:37 no
>> viable alternative at character ' '
>>
>> at
>> com.datastax.driver.core.Responses$Error.asException(Responses.java:101)
>>
>> at
>> com.datastax.driver.core.SessionManager$1.apply(SessionManager.java:185)
>>
>> at
>> com.datastax.driver.core.SessionManager$1.apply(SessionManager.java:160)
>>
>> at com.google.common.util.concurrent.Futures$1.apply(Futures.java:720)
>>
>> at
>> com.google.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:859)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>
>>  at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>


Re: 10000+ CF support from Cassandra

2015-06-24 Thread Arun Chaitanya
 any ideas or advises?

On Mon, Jun 22, 2015 at 10:55 AM, Arun Chaitanya 
wrote:

> Hello All,
>
> Now we settled on the following approach. I want to know if there are any
> problems that you foresee in the production environment.
>
> Our Approach: Use  Off Heap Memory
>
> Modifications to default cassandra.yaml and cassandra-env.sh
> 
>  * memory_allocator: JEMallocAllocator 
> (https://issues.apache.org/jira/browse/CASSANDRA-7883)
>  * memtable_allocation_type: offheap_objects
>
>  By above two, the slab allocation 
> (https://issues.apache.org/jira/browse/CASSANDRA-5935), which requires
>  1MB heap memory per table, is disabled. The memory for table metadata, 
> caches and memtable are thus
>  allocated natively and does not affect GC performance.
>
>  * tombstone_failure_threshold: 1
>
>Without this, C* throws TombstoneOverwhelmingException while in startup.
>This setting looks problematic so I want to know why just creating tables 
> makes so many tombstones ...
>
>  * -XX:+UseG1GC
>
>It is good for reducing GC time.
>Without this, full GCs > 1s are observed.
>
> We created 5000 column families with about 1000 entries per column family.
> The read/write performance seems to stable.
> The problem we saw is with startup time.
>
>  Cassandra Start Time (s) 20
>
>
>
> 349  Average CPU Usage (%) 40
>
>
>
> 49.65  GC Actitivy (%) 2.6
>
>
>
> 0.6
> Thanks a lot in advance.
>
> On Tue, Jun 2, 2015 at 11:26 AM, graham sanderson  wrote:
>
>> > I strongly advise against this approach.
>>> Jon, I think so too. But so you actually foresee any problems with this
>>> approach?
>>> I can think of a few. [I want to evaluate if we can live with this
>>> problem]
>>>
>>>
>>> Just to be clear, I’m not saying this is a great approach, I AM saying
>> that it may be better than having 1+ CFs, which was the original
>> question (it really depends on the use case which wasn’t well defined)… map
>> size limit may be a problem, and then there is the CQL vs thrift question
>> which could start a flame war; ideally CQL maps should give you the same
>> flexibility as arbitrary thrift columns
>>
>> On Jun 1, 2015, at 9:44 PM, Jonathan Haddad  wrote:
>>
>> > Sorry for this naive question but how important is this tuning? Can
>> this have a huge impact in production?
>>
>> Massive.  Here's a graph of when we did some JVM tuning at my previous
>> company:
>>
>>
>> http://33.media.tumblr.com/5d0efca7288dc969c1ac4fc3d36e0151/tumblr_inline_mzvj254quj1rd24f4.png
>>
>> About an order of magnitude difference in performance.
>>
>> Jon
>>
>> On Mon, Jun 1, 2015 at 7:20 PM Arun Chaitanya 
>> wrote:
>>
>>> Thanks Jon and Jack,
>>>
>>> > I strongly advise against this approach.
>>> Jon, I think so too. But so you actually foresee any problems with this
>>> approach?
>>> I can think of a few. [I want to evaluate if we can live with this
>>> problem]
>>>
>>>- No more CQL.
>>>- No data types, everything needs to be a blob.
>>>- Limited clustering Keys and default clustering order.
>>>
>>> > First off, different workloads need different tuning.
>>> Sorry for this naive question but how important is this tuning? Can
>>> this have a huge impact in production?
>>>
>>> > You might want to consider a model where you have an application layer
>>> that maps logical tenant tables into partition keys within a single large
>>> Casandra table, or at least a relatively small number of  Cassandra tables.
>>> It will depend on the typical size of your tenant tables - very small ones
>>> would make sense within a single partition, while larger ones should have
>>> separate partitions for a tenant's data. The key here is that tables are
>>> expensive, but partitions are cheap and scale very well with Cassandra.
>>> We are actually trying similar approach. But we don't want to expose
>>> this to application layer. We are attempting to hide this and provide an
>>> API.
>>>
>>> > Finally, you said "10 clusters", but did you mean 10 nodes? You might
>>> want to consider a model where you do indeed have multiple clusters, where
>>> each handles a fraction of the tenants, since there is no need for separate
>>> tenants to be on the same cluster.
>>> I meant 10 clusters. We want to split our tables across multiple
>>> clusters if above approach is not possible. [But it seems to be very costly]
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, May 29, 2015 at 5:49 AM, Jack Krupansky <
>>> jack.krupan...@gmail.com> wrote:
>>>
 How big is each of the tables - are they all fairly small or fairly
 large? Small as in no more than thousands of rows or large as in tens of
 millions or hundreds of millions of rows?

 Small tables are are not ideal for a Cassandra cluster since the rows
 would be spread out across the nodes, even though it might make more sense
 for each small table to be on a single node.

 You might want to conside

Re: Any use-case about a migration from SQL Server to Cassandra?

2015-06-24 Thread Carlos Alonso
This article from Spotify Labs is a really nice write up of migrating SQL
(Postgres in this case) to Cassandra

Carlos Alonso | Software Engineer | @calonso 

On 23 June 2015 at 20:23, Alex Popescu  wrote:

>
> On Tue, Jun 23, 2015 at 12:13 PM, Marcos Ortiz  wrote:
>
>> 2- They used heavily C# in a Microsoft-based environment, so I need to
>> know if the .Net driver is ready to use for production
>
>
> The DataStax C# driver has been used in production for quite a while by
> numerous users. It is the most up-to-date, feature rich, and
> tunable C# driver for Apache Cassandra and DataStax Enterprise.
>
> Anyways, if there's anything missing we are always happy to improve it.
>
> (as you can see from my sig, I do work for DataStax, but the above is very
> true)
>
>
> --
> Bests,
>
> Alex Popescu | @al3xandru
> Sen. Product Manager @ DataStax
>
>


Re: 10000+ CF support from Cassandra

2015-06-24 Thread Jack Krupansky
By entries, do you mean rows or columns? Please clarify how many columns
each of your tables has, and how many rows you are populating for each
table.

In case I didn't make it clear earlier, limit yourself to "low hundreds"
(like 250) of tables and you should be fine. Thousands of tables is a clear
anti-pattern for Cassandra - not recommended. If it works for you, great,
but if not, don't say you weren't warned.

Disabling of slab allocation is an expert-only feature - its use is
generally an anti-pattern, not recommended.

-- Jack Krupansky

On Sun, Jun 21, 2015 at 10:55 PM, Arun Chaitanya 
wrote:

> Hello All,
>
> Now we settled on the following approach. I want to know if there are any
> problems that you foresee in the production environment.
>
> Our Approach: Use  Off Heap Memory
>
> Modifications to default cassandra.yaml and cassandra-env.sh
> 
>  * memory_allocator: JEMallocAllocator 
> (https://issues.apache.org/jira/browse/CASSANDRA-7883)
>  * memtable_allocation_type: offheap_objects
>
>  By above two, the slab allocation 
> (https://issues.apache.org/jira/browse/CASSANDRA-5935), which requires
>  1MB heap memory per table, is disabled. The memory for table metadata, 
> caches and memtable are thus
>  allocated natively and does not affect GC performance.
>
>  * tombstone_failure_threshold: 1
>
>Without this, C* throws TombstoneOverwhelmingException while in startup.
>This setting looks problematic so I want to know why just creating tables 
> makes so many tombstones ...
>
>  * -XX:+UseG1GC
>
>It is good for reducing GC time.
>Without this, full GCs > 1s are observed.
>
> We created 5000 column families with about 1000 entries per column family.
> The read/write performance seems to stable.
> The problem we saw is with startup time.
>
>  Cassandra Start Time (s) 20
>
>
>
> 349  Average CPU Usage (%) 40
>
>
>
> 49.65  GC Actitivy (%) 2.6
>
>
>
> 0.6
> Thanks a lot in advance.
>
> On Tue, Jun 2, 2015 at 11:26 AM, graham sanderson  wrote:
>
>> > I strongly advise against this approach.
>>> Jon, I think so too. But so you actually foresee any problems with this
>>> approach?
>>> I can think of a few. [I want to evaluate if we can live with this
>>> problem]
>>>
>>>
>>> Just to be clear, I’m not saying this is a great approach, I AM saying
>> that it may be better than having 1+ CFs, which was the original
>> question (it really depends on the use case which wasn’t well defined)… map
>> size limit may be a problem, and then there is the CQL vs thrift question
>> which could start a flame war; ideally CQL maps should give you the same
>> flexibility as arbitrary thrift columns
>>
>> On Jun 1, 2015, at 9:44 PM, Jonathan Haddad  wrote:
>>
>> > Sorry for this naive question but how important is this tuning? Can
>> this have a huge impact in production?
>>
>> Massive.  Here's a graph of when we did some JVM tuning at my previous
>> company:
>>
>>
>> http://33.media.tumblr.com/5d0efca7288dc969c1ac4fc3d36e0151/tumblr_inline_mzvj254quj1rd24f4.png
>>
>> About an order of magnitude difference in performance.
>>
>> Jon
>>
>> On Mon, Jun 1, 2015 at 7:20 PM Arun Chaitanya 
>> wrote:
>>
>>> Thanks Jon and Jack,
>>>
>>> > I strongly advise against this approach.
>>> Jon, I think so too. But so you actually foresee any problems with this
>>> approach?
>>> I can think of a few. [I want to evaluate if we can live with this
>>> problem]
>>>
>>>- No more CQL.
>>>- No data types, everything needs to be a blob.
>>>- Limited clustering Keys and default clustering order.
>>>
>>> > First off, different workloads need different tuning.
>>> Sorry for this naive question but how important is this tuning? Can
>>> this have a huge impact in production?
>>>
>>> > You might want to consider a model where you have an application layer
>>> that maps logical tenant tables into partition keys within a single large
>>> Casandra table, or at least a relatively small number of  Cassandra tables.
>>> It will depend on the typical size of your tenant tables - very small ones
>>> would make sense within a single partition, while larger ones should have
>>> separate partitions for a tenant's data. The key here is that tables are
>>> expensive, but partitions are cheap and scale very well with Cassandra.
>>> We are actually trying similar approach. But we don't want to expose
>>> this to application layer. We are attempting to hide this and provide an
>>> API.
>>>
>>> > Finally, you said "10 clusters", but did you mean 10 nodes? You might
>>> want to consider a model where you do indeed have multiple clusters, where
>>> each handles a fraction of the tenants, since there is no need for separate
>>> tenants to be on the same cluster.
>>> I meant 10 clusters. We want to split our tables across multiple
>>> clusters if above approach is not possible. [But it seems to be very costly]
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, May 

InvalidQueryException: Invalid amount of bind variables

2015-06-24 Thread Eax Melanhovich
Hello.

I'm having some problems with Cassandra driver for Java.

Here is a simple Scala project:

https://github.com/afiskon/scala-cassandra-example

When I run it I get following output:

http://paste.ubuntu.com/11767987/

As I understand this piece of code:

```
private val id = "id"
private val description = "description"

QB.insertInto(table)
  .value(id, dto.id)
  .value(description, dto.descr)
  .getQueryString
```

... generates query string:

INSERT INTO todo_list(id,description) VALUES (1,?)

But I can't figure out why the second value is missing.

What am I doing wrong?

-- 
Best regards,
Eax Melanhovich
http://eax.me/


Re: InvalidQueryException: Invalid amount of bind variables

2015-06-24 Thread Eax Melanhovich
Ok, I discovered that passing Statement instead of string to
executeAsync method solves a problem:

https://github.com/afiskon/scala-cassandra-example/commit/4f3f30597a4df340f739e4ec53ec9ee3d87da495

Still, according to documentation for getQueryString method described
problem should be considered a bug, right?

On Wed, 24 Jun 2015 17:35:22 +0300
Eax Melanhovich  wrote:

> Hello.
> 
> I'm having some problems with Cassandra driver for Java.
> 
> Here is a simple Scala project:
> 
> https://github.com/afiskon/scala-cassandra-example
> 
> When I run it I get following output:
> 
> http://paste.ubuntu.com/11767987/
> 
> As I understand this piece of code:
> 
> ```
> private val id = "id"
> private val description = "description"
> 
> QB.insertInto(table)
>   .value(id, dto.id)
>   .value(description, dto.descr)
>   .getQueryString
> ```
> 
> ... generates query string:
> 
> INSERT INTO todo_list(id,description) VALUES (1,?)
> 
> But I can't figure out why the second value is missing.
> 
> What am I doing wrong?
> 



-- 
Best regards,
Eax Melanhovich
http://eax.me/


Re: 10000+ CF support from Cassandra

2015-06-24 Thread Arun Chaitanya
Hi Jack,

When I mean entries, I meant rows. Each column family has about 200 columns.

> Disabling of slab allocation is an expert-only feature - its use is
generally an anti-pattern, not recommended.
I understand this and have seen this recommendation at several places. I
want to understand the consequences? Is it performance, maintenance or
scalability, that is at stake.

In our use case we have about 3000 column families (ofcourse modelled in
RDBMS). If I were to limit to 250 column families, do you advise us to use
multiple clusters (the problem being cost ineffective)?

If we were to use a single cluster and support 3000 column families, the
only idea is to group few column families and store them in one column
family. In this case, grouping is a difficult task, imo. And if we want an
abstraction of grouping for developer, we need special connector for
Hadoop/Spark systems. So I do not want to enter this territory.

Sorry for such questions, but I am still wondering if I am the only one
facing this problem.

Thanks a lot,
Arun



On Wed, Jun 24, 2015 at 10:28 PM, Jack Krupansky 
wrote:

> By entries, do you mean rows or columns? Please clarify how many columns
> each of your tables has, and how many rows you are populating for each
> table.
>
> In case I didn't make it clear earlier, limit yourself to "low hundreds"
> (like 250) of tables and you should be fine. Thousands of tables is a clear
> anti-pattern for Cassandra - not recommended. If it works for you, great,
> but if not, don't say you weren't warned.
>
> Disabling of slab allocation is an expert-only feature - its use is
> generally an anti-pattern, not recommended.
>
> -- Jack Krupansky
>
> On Sun, Jun 21, 2015 at 10:55 PM, Arun Chaitanya 
> wrote:
>
>> Hello All,
>>
>> Now we settled on the following approach. I want to know if there are any
>> problems that you foresee in the production environment.
>>
>> Our Approach: Use  Off Heap Memory
>>
>> Modifications to default cassandra.yaml and cassandra-env.sh
>> 
>>  * memory_allocator: JEMallocAllocator 
>> (https://issues.apache.org/jira/browse/CASSANDRA-7883)
>>  * memtable_allocation_type: offheap_objects
>>
>>  By above two, the slab allocation 
>> (https://issues.apache.org/jira/browse/CASSANDRA-5935), which requires
>>  1MB heap memory per table, is disabled. The memory for table metadata, 
>> caches and memtable are thus
>>  allocated natively and does not affect GC performance.
>>
>>  * tombstone_failure_threshold: 1
>>
>>Without this, C* throws TombstoneOverwhelmingException while in startup.
>>This setting looks problematic so I want to know why just creating tables 
>> makes so many tombstones ...
>>
>>  * -XX:+UseG1GC
>>
>>It is good for reducing GC time.
>>Without this, full GCs > 1s are observed.
>>
>> We created 5000 column families with about 1000 entries per column
>> family. The read/write performance seems to stable.
>> The problem we saw is with startup time.
>>
>>  Cassandra Start Time (s) 20
>>
>>
>>
>> 349  Average CPU Usage (%) 40
>>
>>
>>
>> 49.65  GC Actitivy (%) 2.6
>>
>>
>>
>> 0.6
>> Thanks a lot in advance.
>>
>> On Tue, Jun 2, 2015 at 11:26 AM, graham sanderson 
>> wrote:
>>
>>> > I strongly advise against this approach.
 Jon, I think so too. But so you actually foresee any problems with this
 approach?
 I can think of a few. [I want to evaluate if we can live with this
 problem]


 Just to be clear, I’m not saying this is a great approach, I AM saying
>>> that it may be better than having 1+ CFs, which was the original
>>> question (it really depends on the use case which wasn’t well defined)… map
>>> size limit may be a problem, and then there is the CQL vs thrift question
>>> which could start a flame war; ideally CQL maps should give you the same
>>> flexibility as arbitrary thrift columns
>>>
>>> On Jun 1, 2015, at 9:44 PM, Jonathan Haddad  wrote:
>>>
>>> > Sorry for this naive question but how important is this tuning? Can
>>> this have a huge impact in production?
>>>
>>> Massive.  Here's a graph of when we did some JVM tuning at my previous
>>> company:
>>>
>>>
>>> http://33.media.tumblr.com/5d0efca7288dc969c1ac4fc3d36e0151/tumblr_inline_mzvj254quj1rd24f4.png
>>>
>>> About an order of magnitude difference in performance.
>>>
>>> Jon
>>>
>>> On Mon, Jun 1, 2015 at 7:20 PM Arun Chaitanya 
>>> wrote:
>>>
 Thanks Jon and Jack,

 > I strongly advise against this approach.
 Jon, I think so too. But so you actually foresee any problems with this
 approach?
 I can think of a few. [I want to evaluate if we can live with this
 problem]

- No more CQL.
- No data types, everything needs to be a blob.
- Limited clustering Keys and default clustering order.

 > First off, different workloads need different tuning.
 Sorry for this naive question but how important is this tuning? Can

Adding Nodes With Inconsistent Data

2015-06-24 Thread Anuj Wadehra
Hi,


We faced a scenario where we lost little data after adding 2 nodes in the 
cluster. There were intermittent dropped mutations in the cluster. Need to 
verify my understanding how this may have happened to do Root Cause Analysis:


Scenario: 3 nodes, RF=3, Read / Write CL= Quorum


1. Due to overloaded cluster, some writes just happened on 2 nodes: node 1 & 
node 2 whike asynchronous mutations dropped on node 3.

So say key K with Token T was not written to 3.


2. I added node 4 and suppose as per newly calculated ranges, now token T is 
supposed to have replicas on node 1, node 3, and node 4. Unfortunately node 4 
started bootstrapping from node 3 where key K was missing.


3. After 2 min gap recommended, I added node 5 and as per new token 
distribution suppose token T now is suppossed to have replicas on node 3, node 
4 and node 5. Again node 5 bootstrapped from node 3 where data was misssing.


So now key K is lost and thats how we list very few rows.


Moreover, in step 1 situation could be worse. we can also have a scenario where 
some writes just happened on one of three replicas and cassandra chooses  
replicas where this data is missing for streaming ranges to 2 new nodes.


Am I making sense?


We are using C* 2.0.3.


Thanks

Anuj




Sent from Yahoo Mail on Android



Re: 10000+ CF support from Cassandra

2015-06-24 Thread Jack Krupansky
I would say that it's mostly a performance issue, tied to memory
management, but the main problem is that a large number of tables invites a
whole host of clluster management difficulties that require... expert
attention, which then means you need an expert to maintain and enhance it.

Cassandra scales in two ways: number of rows and number of nodes, but not
number of tables. Both number of tables and number of columns per row need
to be kept moderate for your cluster to be manageable and perform well.

Adding a tenant ID to your table partition key is the optimal approach to
multi-tenancy at this stage with Cassandra. That, and maybe also assigning
subsets of the tenants to different tables, as well as having separate
clusters if your number of tenants and rows gets too large.

-- Jack Krupansky

On Wed, Jun 24, 2015 at 11:55 AM, Arun Chaitanya 
wrote:

> Hi Jack,
>
> When I mean entries, I meant rows. Each column family has about 200
> columns.
>
> > Disabling of slab allocation is an expert-only feature - its use is
> generally an anti-pattern, not recommended.
> I understand this and have seen this recommendation at several places. I
> want to understand the consequences? Is it performance, maintenance or
> scalability, that is at stake.
>
> In our use case we have about 3000 column families (ofcourse modelled in
> RDBMS). If I were to limit to 250 column families, do you advise us to use
> multiple clusters (the problem being cost ineffective)?
>
> If we were to use a single cluster and support 3000 column families, the
> only idea is to group few column families and store them in one column
> family. In this case, grouping is a difficult task, imo. And if we want an
> abstraction of grouping for developer, we need special connector for
> Hadoop/Spark systems. So I do not want to enter this territory.
>
> Sorry for such questions, but I am still wondering if I am the only one
> facing this problem.
>
> Thanks a lot,
> Arun
>
>
>
> On Wed, Jun 24, 2015 at 10:28 PM, Jack Krupansky  > wrote:
>
>> By entries, do you mean rows or columns? Please clarify how many columns
>> each of your tables has, and how many rows you are populating for each
>> table.
>>
>> In case I didn't make it clear earlier, limit yourself to "low hundreds"
>> (like 250) of tables and you should be fine. Thousands of tables is a clear
>> anti-pattern for Cassandra - not recommended. If it works for you, great,
>> but if not, don't say you weren't warned.
>>
>> Disabling of slab allocation is an expert-only feature - its use is
>> generally an anti-pattern, not recommended.
>>
>> -- Jack Krupansky
>>
>> On Sun, Jun 21, 2015 at 10:55 PM, Arun Chaitanya > > wrote:
>>
>>> Hello All,
>>>
>>> Now we settled on the following approach. I want to know if there are
>>> any problems that you foresee in the production environment.
>>>
>>> Our Approach: Use  Off Heap Memory
>>>
>>> Modifications to default cassandra.yaml and cassandra-env.sh
>>> 
>>>  * memory_allocator: JEMallocAllocator 
>>> (https://issues.apache.org/jira/browse/CASSANDRA-7883)
>>>  * memtable_allocation_type: offheap_objects
>>>
>>>  By above two, the slab allocation 
>>> (https://issues.apache.org/jira/browse/CASSANDRA-5935), which requires
>>>  1MB heap memory per table, is disabled. The memory for table metadata, 
>>> caches and memtable are thus
>>>  allocated natively and does not affect GC performance.
>>>
>>>  * tombstone_failure_threshold: 1
>>>
>>>Without this, C* throws TombstoneOverwhelmingException while in startup.
>>>This setting looks problematic so I want to know why just creating 
>>> tables makes so many tombstones ...
>>>
>>>  * -XX:+UseG1GC
>>>
>>>It is good for reducing GC time.
>>>Without this, full GCs > 1s are observed.
>>>
>>> We created 5000 column families with about 1000 entries per column
>>> family. The read/write performance seems to stable.
>>> The problem we saw is with startup time.
>>>
>>>  Cassandra Start Time (s) 20
>>>
>>>
>>>
>>> 349  Average CPU Usage (%) 40
>>>
>>>
>>>
>>> 49.65  GC Actitivy (%) 2.6
>>>
>>>
>>>
>>> 0.6
>>> Thanks a lot in advance.
>>>
>>> On Tue, Jun 2, 2015 at 11:26 AM, graham sanderson 
>>> wrote:
>>>
 > I strongly advise against this approach.
> Jon, I think so too. But so you actually foresee any problems with
> this approach?
> I can think of a few. [I want to evaluate if we can live with this
> problem]
>
>
> Just to be clear, I’m not saying this is a great approach, I AM saying
 that it may be better than having 1+ CFs, which was the original
 question (it really depends on the use case which wasn’t well defined)… map
 size limit may be a problem, and then there is the CQL vs thrift question
 which could start a flame war; ideally CQL maps should give you the same
 flexibility as arbitrary thrift columns

 On Jun 1, 2015, at 9:44 PM, Jonathan Haddad  wrote:


Re: [MASSMAIL]Re: Any use-case about a migration from SQL Server to Cassandra?

2015-06-24 Thread Marcos Ortiz

Where is the link, Carlos?


On 24/06/15 07:18, Carlos Alonso wrote:
This article from Spotify Labs is a really nice write up of migrating 
SQL (Postgres in this case) to Cassandra


Carlos Alonso | Software Engineer | @calonso 

On 23 June 2015 at 20:23, Alex Popescu > wrote:



On Tue, Jun 23, 2015 at 12:13 PM, Marcos Ortiz mailto:mlor...@uci.cu>> wrote:

2- They used heavily C# in a Microsoft-based environment, so I
need to know if the .Net driver is ready to use for production


The DataStax C# driver has been used in production for quite a
while by numerous users. It is the most up-to-date, feature rich, and
tunable C# driver for Apache Cassandra and DataStax Enterprise.

Anyways, if there's anything missing we are always happy to
improve it.

(as you can see from my sig, I do work for DataStax, but the above
is very true)


-- 
Bests,


Alex Popescu | @al3xandru
Sen. Product Manager @ DataStax




--
Marcos Ortiz , Sr. Product Manager (Data 
Infrastructure) at UCI

@marcosluis2186 


Re: Adding Nodes With Inconsistent Data

2015-06-24 Thread Alain RODRIGUEZ
It looks to me that can indeed happen theoretically (I might be wrong).

However,

- Hinted Handoff tends to remove this issue, if this is big worry, you
might want to make sure HH are enabled and well tuned
- Read Repairs (synchronous or not) might have mitigate things also, if you
read fresh data. You can set this to higher values.
- After an outage, you should always run a nodetool repair on the node that
went done - following the best practices, or because you understand the
reasons - or just trust HH if it is enough to you.

So I would say that you can always "shoot yourself in your foot", whatever
you do, yet following best practices or understanding the internals is the
key imho.

I would say it is a good question though.

Alain.



2015-06-24 19:43 GMT+02:00 Anuj Wadehra :

> Hi,
>
> We faced a scenario where we lost little data after adding 2 nodes in the
> cluster. There were intermittent dropped mutations in the cluster. Need to
> verify my understanding how this may have happened to do Root Cause
> Analysis:
>
> Scenario: 3 nodes, RF=3, Read / Write CL= Quorum
>
> 1. Due to overloaded cluster, some writes just happened on 2 nodes: node 1
> & node 2 whike asynchronous mutations dropped on node 3.
> So say key K with Token T was not written to 3.
>
> 2. I added node 4 and suppose as per newly calculated ranges, now token T
> is supposed to have replicas on node 1, node 3, and node 4. Unfortunately
> node 4 started bootstrapping from node 3 where key K was missing.
>
> 3. After 2 min gap recommended, I added node 5 and as per new token
> distribution suppose token T now is suppossed to have replicas on node 3,
> node 4 and node 5. Again node 5 bootstrapped from node 3 where data was
> misssing.
>
> So now key K is lost and thats how we list very few rows.
>
> Moreover, in step 1 situation could be worse. we can also have a scenario
> where some writes just happened on one of three replicas and cassandra
> chooses  replicas where this data is missing for streaming ranges to 2 new
> nodes.
>
> Am I making sense?
>
> We are using C* 2.0.3.
>
> Thanks
> Anuj
>
>
>
> Sent from Yahoo Mail on Android
> 
>


Re: [MASSMAIL]Re: Any use-case about a migration from SQL Server to Cassandra?

2015-06-24 Thread Paulo Ricardo Motta Gomes
https://labs.spotify.com/2015/06/23/user-database-switch/

On Wed, Jun 24, 2015 at 5:57 PM, Marcos Ortiz  wrote:

>  Where is the link, Carlos?
>
>
> On 24/06/15 07:18, Carlos Alonso wrote:
>
> This article from Spotify Labs is a really nice write up of migrating SQL
> (Postgres in this case) to Cassandra
>
>  Carlos Alonso | Software Engineer | @calonso
> 
>
> On 23 June 2015 at 20:23, Alex Popescu  wrote:
>
>>
>> On Tue, Jun 23, 2015 at 12:13 PM, Marcos Ortiz  wrote:
>>
>>> 2- They used heavily C# in a Microsoft-based environment, so I need to
>>> know if the .Net driver is ready to use for production
>>
>>
>> The DataStax C# driver has been used in production for quite a while by
>> numerous users. It is the most up-to-date, feature rich, and
>> tunable C# driver for Apache Cassandra and DataStax Enterprise.
>>
>>  Anyways, if there's anything missing we are always happy to improve it.
>>
>>  (as you can see from my sig, I do work for DataStax, but the above is
>> very true)
>>
>>
>>  --
>>Bests,
>>
>> Alex Popescu | @al3xandru
>> Sen. Product Manager @ DataStax
>>
>>
>
> --
> Marcos Ortiz , Sr. Product Manager (Data
> Infrastructure) at UCI
> @marcosluis2186 
>
>
>


-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br *
+55 48 3232.3200


Re: [MASSMAIL]Re: Any use-case about a migration from SQL Server to Cassandra?

2015-06-24 Thread Alain RODRIGUEZ
I guess it is this one, enjoy it:
https://labs.spotify.com/2015/06/23/user-database-switch/ :-)

2015-06-24 22:57 GMT+02:00 Marcos Ortiz :

>  Where is the link, Carlos?
>
>
> On 24/06/15 07:18, Carlos Alonso wrote:
>
> This article from Spotify Labs is a really nice write up of migrating SQL
> (Postgres in this case) to Cassandra
>
>  Carlos Alonso | Software Engineer | @calonso
> 
>
> On 23 June 2015 at 20:23, Alex Popescu  wrote:
>
>>
>> On Tue, Jun 23, 2015 at 12:13 PM, Marcos Ortiz  wrote:
>>
>>> 2- They used heavily C# in a Microsoft-based environment, so I need to
>>> know if the .Net driver is ready to use for production
>>
>>
>> The DataStax C# driver has been used in production for quite a while by
>> numerous users. It is the most up-to-date, feature rich, and
>> tunable C# driver for Apache Cassandra and DataStax Enterprise.
>>
>>  Anyways, if there's anything missing we are always happy to improve it.
>>
>>  (as you can see from my sig, I do work for DataStax, but the above is
>> very true)
>>
>>
>>  --
>>Bests,
>>
>> Alex Popescu | @al3xandru
>> Sen. Product Manager @ DataStax
>>
>>
>
> --
> Marcos Ortiz , Sr. Product Manager (Data
> Infrastructure) at UCI
> @marcosluis2186 
>
>
>


Re: Adding Nodes With Inconsistent Data

2015-06-24 Thread Jake Luciani
This is no longer an issue in 2.1.
https://issues.apache.org/jira/browse/CASSANDRA-2434

We now make sure the replica we bootstrap from is the one that will no
longer own that range

On Wed, Jun 24, 2015 at 4:58 PM, Alain RODRIGUEZ  wrote:

> It looks to me that can indeed happen theoretically (I might be wrong).
>
> However,
>
> - Hinted Handoff tends to remove this issue, if this is big worry, you
> might want to make sure HH are enabled and well tuned
> - Read Repairs (synchronous or not) might have mitigate things also, if
> you read fresh data. You can set this to higher values.
> - After an outage, you should always run a nodetool repair on the node
> that went done - following the best practices, or because you understand
> the reasons - or just trust HH if it is enough to you.
>
> So I would say that you can always "shoot yourself in your foot", whatever
> you do, yet following best practices or understanding the internals is the
> key imho.
>
> I would say it is a good question though.
>
> Alain.
>
>
>
> 2015-06-24 19:43 GMT+02:00 Anuj Wadehra :
>
>> Hi,
>>
>> We faced a scenario where we lost little data after adding 2 nodes in the
>> cluster. There were intermittent dropped mutations in the cluster. Need to
>> verify my understanding how this may have happened to do Root Cause
>> Analysis:
>>
>> Scenario: 3 nodes, RF=3, Read / Write CL= Quorum
>>
>> 1. Due to overloaded cluster, some writes just happened on 2 nodes: node
>> 1 & node 2 whike asynchronous mutations dropped on node 3.
>> So say key K with Token T was not written to 3.
>>
>> 2. I added node 4 and suppose as per newly calculated ranges, now token T
>> is supposed to have replicas on node 1, node 3, and node 4. Unfortunately
>> node 4 started bootstrapping from node 3 where key K was missing.
>>
>> 3. After 2 min gap recommended, I added node 5 and as per new token
>> distribution suppose token T now is suppossed to have replicas on node 3,
>> node 4 and node 5. Again node 5 bootstrapped from node 3 where data was
>> misssing.
>>
>> So now key K is lost and thats how we list very few rows.
>>
>> Moreover, in step 1 situation could be worse. we can also have a scenario
>> where some writes just happened on one of three replicas and cassandra
>> chooses  replicas where this data is missing for streaming ranges to 2 new
>> nodes.
>>
>> Am I making sense?
>>
>> We are using C* 2.0.3.
>>
>> Thanks
>> Anuj
>>
>>
>>
>> Sent from Yahoo Mail on Android
>> 
>>
>
>


-- 
http://twitter.com/tjake


Read is slower in 2.1.6 than 2.0.14?

2015-06-24 Thread Zhiyan Shao
Hi,

we recently experimented read performance on both versions and found read
is slower in 2.1.6. Here is our setup:

1. Machines: 3 physical hosts. Each node has 24 cores CPU, 256G memory and
8x600GB SAS disks with raid 1.
2. Replica is 3 and a billion rows of data is inserted.
3. Key cache capacity is increased to 50G on each node.
4. Keep querying the same set of a million partition keys in a loop.

Result:
For 2.0.14, we can get an average of 6 ms while for 2.1.6, we can only get
18 ms

It seems key cache hit rate 0.011 is pretty low even though the same set of
keys were used. Has anybody done similar read performance testing? Could
you share your results?

Thanks,
Zhiyan


Range not found after nodetool decommission

2015-06-24 Thread 曹志富
ERROR [OptionalTasks:1] 2015-06-25 08:56:19,156 CassandraDaemon.java:223 -
Exception in thread Thread[OptionalTasks:1,5,main]
java.lang.AssertionError: -110036444293069784 not found in
--
Ranger Tsao


Re: Read is slower in 2.1.6 than 2.0.14?

2015-06-24 Thread Alain RODRIGUEZ
I am amazed to see that you don't have OOM with this setup...

1 - for performances and given Cassandra replication properties an I/O
usage you might want to try with a Raid0. But I imagine this is tradeoff.

2 - A billion is quite a few and any of your nodes takes the full load. You
might want to try with RF 2 and CL one if performance is what you are
looking for.

3 - Using 50 GB of key cache is something I never saw and can't be good,
since afaik, key cache is on heap and you don"t really want a heap bigger
than 8 GB ( or 10/12 GB for some cases). Try with default heap size and key
cache.

4 - Are you querying the set at once ? You might want to query rows one by
one, maybe in a synchronous way to have back pressure.

An other question would be: did you use native protocol or rather thrift ?
( http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)

BTW interesting benchmark, but having the right conf is interesting. Also
you might want to go to 2.1.7 that mainly fixes a memory leak afaik.

C*heers,

Alain
Le 25 juin 2015 01:23, "Zhiyan Shao"  a écrit :

> Hi,
>
> we recently experimented read performance on both versions and found read
> is slower in 2.1.6. Here is our setup:
>
> 1. Machines: 3 physical hosts. Each node has 24 cores CPU, 256G memory and
> 8x600GB SAS disks with raid 1.
> 2. Replica is 3 and a billion rows of data is inserted.
> 3. Key cache capacity is increased to 50G on each node.
> 4. Keep querying the same set of a million partition keys in a loop.
>
> Result:
> For 2.0.14, we can get an average of 6 ms while for 2.1.6, we can only get
> 18 ms
>
> It seems key cache hit rate 0.011 is pretty low even though the same set
> of keys were used. Has anybody done similar read performance testing? Could
> you share your results?
>
> Thanks,
> Zhiyan
>