答复:

2014-12-24 Thread 鄢来琼
Yeah, I also have the question.
My solution is not delete the row, but insert the right row to a new table.

Thanks & Regards,
Peter YAN

发件人: Sávio S. Teles de Oliveira [mailto:savio.te...@cuia.com.br]
发送时间: 2014年8月26日 4:25
收件人: user@cassandra.apache.org
主题:


We're using cassandra 2.0.9 with datastax java cassandra driver 2.0.0 in a 
cluster of eight nodes.

We're doing an insert and after a delete like:

delete from column_family_name where id = value

Immediatly select to check whether the DELETE was successful. Sometimes the 
value still there!!



Any suggestions?

--
Atenciosamente,
Sávio S. Teles de Oliveira
voice: +55 62 9136 6996
http://br.linkedin.com/in/savioteles
Mestrando em Ciências da Computação - UFG
Arquiteto de Software
CUIA Internet Brasil


Re: [Merging data from memtables and 1 sstables] takes too much time.

2014-12-24 Thread nitin padalia
Is merging costly operation with wide rows?
On Dec 10, 2014 5:53 PM, "nitin padalia"  wrote:

> I am using a schema like below:
>
> CREATE TABLE user_location_map (
> store_id uuid,
> location_id uuid,
> user_serial_number text,
> userobjectid uuid,
> PRIMARY KEY ((store_id, location_id), user_serial_number)
> ) WITH CLUSTERING ORDER BY (user_serial_number ASC)
> AND bloom_filter_fp_chance = 0.01
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'min_threshold': '4', 'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32'}
> AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99.0PERCENTILE';
>
> Where I run a query like:
> select * from  user_location_map where store_id =
> 17b73358-79e6-11e4-bfd4-0050568aa211 and location_id =
> 2c269ea4-dbfd-32dd-9bd7-a5c22677d18b and user_serial_number =
> 'uI2201';
>
> some times queries like above complete in 3-4 milliseconds, however
> few times they take around 80-90 milliseconds. The data is around 1
> million distributed in 5 nodes with RF 3.
>
> Tacing shows every time most time is consumed by:
> Merging data from memtables and 1 sstables
>
> What could the reason that some times this take too long, however rest
> of the time its fast.
>


[Cassandra] [Generation of SStableLoader slow]

2014-12-24 Thread 严超
Hi, Everyone:

I'm importing a CSV file into Cassandra using SStableLoader. And I'm
following the example here:
https://github.com/yukim/cassandra-bulkload-example/

But, Even though the streaming of SSTables is very fast , I find that
generation of SStables is quite slow for very large files (CSV, 4GB+). I am
using a Dual Core computer with 2 GB ram. Could it be because of the system
spec or any other factor?

Thank you for any advice.

*Best Regards!*


*Chao Yan--**My twitter:Andy Yan @yanchao727
*


*My Weibo:http://weibo.com/herewearenow
--*


Re: [Cassandra] [Generation of SStableLoader slow]

2014-12-24 Thread Ryan Svihla
I think that'd be slow copying large files with just the cp command.
Cassandra isn't doing anything amazingly strange here, you don't have a lot
of RAM, nor CPU and I'm assuming the underlying disk is slow here as well.
Without more parameters and details it's hard to define if there is an
issue.

On Dec 24, 2014 7:36 AM, "严超"  wrote:

> Hi, Everyone:
>
> I'm importing a CSV file into Cassandra using SStableLoader. And I'm
> following the example here:
> https://github.com/yukim/cassandra-bulkload-example/
>
> But, Even though the streaming of SSTables is very fast , I find that
> generation of SStables is quite slow for very large files (CSV, 4GB+). I am
> using a Dual Core computer with 2 GB ram. Could it be because of the system
> spec or any other factor?
>
> Thank you for any advice.
>
> *Best Regards!*
>
>
> *Chao Yan--**My twitter:Andy Yan @yanchao727
> *
>
>
> *My Weibo:http://weibo.com/herewearenow
> --*
>


Re: [Merging data from memtables and 1 sstables] takes too much time.

2014-12-24 Thread Ryan Svihla
Is the underlying disk spinning disk? Because that'd be about right for a
cold read (non cached), the fast reads would likely be in buffer cache or
just pure memtable reads.

On Wed, Dec 24, 2014 at 5:32 AM, nitin padalia 
wrote:

> Is merging costly operation with wide rows?
> On Dec 10, 2014 5:53 PM, "nitin padalia"  wrote:
>
>> I am using a schema like below:
>>
>> CREATE TABLE user_location_map (
>> store_id uuid,
>> location_id uuid,
>> user_serial_number text,
>> userobjectid uuid,
>> PRIMARY KEY ((store_id, location_id), user_serial_number)
>> ) WITH CLUSTERING ORDER BY (user_serial_number ASC)
>> AND bloom_filter_fp_chance = 0.01
>> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>> AND comment = ''
>> AND compaction = {'min_threshold': '4', 'class':
>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
>> 'max_threshold': '32'}
>> AND compression = {'sstable_compression':
>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>> AND dclocal_read_repair_chance = 0.1
>> AND default_time_to_live = 0
>> AND gc_grace_seconds = 864000
>> AND max_index_interval = 2048
>> AND memtable_flush_period_in_ms = 0
>> AND min_index_interval = 128
>> AND read_repair_chance = 0.0
>> AND speculative_retry = '99.0PERCENTILE';
>>
>> Where I run a query like:
>> select * from  user_location_map where store_id =
>> 17b73358-79e6-11e4-bfd4-0050568aa211 and location_id =
>> 2c269ea4-dbfd-32dd-9bd7-a5c22677d18b and user_serial_number =
>> 'uI2201';
>>
>> some times queries like above complete in 3-4 milliseconds, however
>> few times they take around 80-90 milliseconds. The data is around 1
>> million distributed in 5 nodes with RF 3.
>>
>> Tacing shows every time most time is consumed by:
>> Merging data from memtables and 1 sstables
>>
>> What could the reason that some times this take too long, however rest
>> of the time its fast.
>>
>


-- 

[image: datastax_logo.png] 

Ryan Svihla

Solution Architect

[image: twitter.png]  [image: linkedin.png]


DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: 答复:

2014-12-24 Thread Ryan Svihla
Every time I've heard this but one this has been clock skew  (and that was
swallowed exceptions), however it can just be you have a test that is prone
to race conditions (delete followed by an immediate select with a low
consistency level), without more detail it's hard to say.

I'd check the nodes for time skew by running ntpdate on each node, and make
sure ntpd is pointing to the same servers.

On Wed, Dec 24, 2014 at 2:53 AM, 鄢来琼  wrote:

>  Yeah, I also have the question.
>
> My solution is not delete the row, but insert the right row to a new table.
>
>
>
> Thanks & Regards,
>
> *Peter YAN*
>
>
>
> *发件人:* Sávio S. Teles de Oliveira [mailto:savio.te...@cuia.com.br]
> *发送时间:* 2014年8月26日 4:25
> *收件人:* user@cassandra.apache.org
> *主题:*
>
>
>
> We're using cassandra 2.0.9 with datastax java cassandra driver 2.0.0 in a
> cluster of eight nodes.
>
> We're doing an insert and after a delete like:
>
> delete from *column_family_name* where *id* = value
>
> Immediatly select to check whether the DELETE was successful. Sometimes
> the value still there!!
>
>
>
> Any suggestions?
>
> --
>
> Atenciosamente,
> Sávio S. Teles de Oliveira
>
> voice: +55 62 9136 6996
> http://br.linkedin.com/in/savioteles
>
> Mestrando em Ciências da Computação - UFG
> Arquiteto de Software
>
> CUIA Internet Brasil
>



-- 

[image: datastax_logo.png] 

Ryan Svihla

Solution Architect

[image: twitter.png]  [image: linkedin.png]


DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: [Cassandra] [Generation of SStableLoader slow]

2014-12-24 Thread 严超
Yes, I think so too. Plus, I used VM with 4 CPUs and 2 CPUs, and 4CPUs
really did faster.
But It took 1 hour to generate sstable for 1G csv. I am wondering if there
is other way to make it faster except adding CPUs and ram.

*Best Regards!*


*Chao Yan--**My twitter:Andy Yan @yanchao727
*


*My Weibo:http://weibo.com/herewearenow
--*

2014-12-24 20:40 GMT+08:00 Ryan Svihla :

> I think that'd be slow copying large files with just the cp command.
> Cassandra isn't doing anything amazingly strange here, you don't have a lot
> of RAM, nor CPU and I'm assuming the underlying disk is slow here as well.
> Without more parameters and details it's hard to define if there is an
> issue.
>
> On Dec 24, 2014 7:36 AM, "严超"  wrote:
>
>> Hi, Everyone:
>>
>> I'm importing a CSV file into Cassandra using SStableLoader. And I'm
>> following the example here:
>> https://github.com/yukim/cassandra-bulkload-example/
>>
>> But, Even though the streaming of SSTables is very fast , I find that
>> generation of SStables is quite slow for very large files (CSV, 4GB+). I am
>> using a Dual Core computer with 2 GB ram. Could it be because of the system
>> spec or any other factor?
>>
>> Thank you for any advice.
>>
>> *Best Regards!*
>>
>>
>> *Chao Yan--**My twitter:Andy Yan @yanchao727
>> *
>>
>>
>> *My Weibo:http://weibo.com/herewearenow
>> --*
>>
>


Re: Tombstones without DELETE

2014-12-24 Thread Ryan Svihla
You should probably ask on the Cassandra user mailling list.

However, TTL is the only other case I can think of.

On Tue, Dec 23, 2014 at 1:36 PM, Davide D'Agostino  wrote:

> Hi there,
>
> Following this:
> https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/tombstone/java-driver-user/cHE3OOSIXBU/moLXcif1zQwJ
>
> Under what conditions Cassandra generates a tombstone?
>
> Basically I have not even big table on cassandra (90M rows) in my code
> there is no delete and I use prepared statements (but binding all necessary
> values).
>
> I'm aware that a tombstone gets created when:
>
> 1. You delete the row
> 2. You set a column to null while previously it had a value
> 3. When you use prepared statements and you don't bind all the values
>
> Anything else that I should be aware of?
>
> Thanks!
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to java-driver-user+unsubscr...@lists.datastax.com.
>



-- 

[image: datastax_logo.png] 

Ryan Svihla

Solution Architect

[image: twitter.png]  [image: linkedin.png]


DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: [Cassandra] [Generation of SStableLoader slow]

2014-12-24 Thread Ryan Svihla
I doubt it there are huge gains with tinkering if adding more CPU speeds
the things up, that indicates you're resource bound. It's over a VM, it's
probably a slow underlying disk, there is just physics at some point. You
can try playing with using the java client instead of the sstableloader but
I doubt that will actually be faster for your particular use case.

On Wed, Dec 24, 2014 at 7:05 AM, 严超  wrote:

> Yes, I think so too. Plus, I used VM with 4 CPUs and 2 CPUs, and 4CPUs
> really did faster.
> But It took 1 hour to generate sstable for 1G csv. I am wondering if there
> is other way to make it faster except adding CPUs and ram.
>
> *Best Regards!*
>
>
> *Chao Yan--**My twitter:Andy Yan @yanchao727
> *
>
>
> *My Weibo:http://weibo.com/herewearenow
> --*
>
> 2014-12-24 20:40 GMT+08:00 Ryan Svihla :
>
>> I think that'd be slow copying large files with just the cp command.
>> Cassandra isn't doing anything amazingly strange here, you don't have a lot
>> of RAM, nor CPU and I'm assuming the underlying disk is slow here as well.
>> Without more parameters and details it's hard to define if there is an
>> issue.
>>
>> On Dec 24, 2014 7:36 AM, "严超"  wrote:
>>
>>> Hi, Everyone:
>>>
>>> I'm importing a CSV file into Cassandra using SStableLoader. And I'm
>>> following the example here:
>>> https://github.com/yukim/cassandra-bulkload-example/
>>>
>>> But, Even though the streaming of SSTables is very fast , I find that
>>> generation of SStables is quite slow for very large files (CSV, 4GB+). I am
>>> using a Dual Core computer with 2 GB ram. Could it be because of the system
>>> spec or any other factor?
>>>
>>> Thank you for any advice.
>>>
>>> *Best Regards!*
>>>
>>>
>>> *Chao Yan--**My twitter:Andy Yan @yanchao727
>>> *
>>>
>>>
>>> *My Weibo:http://weibo.com/herewearenow
>>> --*
>>>
>>
>


-- 

[image: datastax_logo.png] 

Ryan Svihla

Solution Architect

[image: twitter.png]  [image: linkedin.png]


DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: CQL3 vs Thrift

2014-12-24 Thread Ryan Svihla
I'm not entirely certain how you can't model that to solve your use case
(wouldn't you be filtering the events as well, and therefore be able to get
all that in one query).

 What you describe there has a number of avenues (collections, just heavier
use of statics in a different order than you specified, object dump of
events in a single column, switching up the clustering columns) of getting
your question answered in one query. End of the day cql resolves to a given
SStable format, you can still open up cassandra-cli and view what a given
model looks like, when you've grokked this adequately you basically can
bend CQL to fit your logical thrift modeling, at some point like learning
any new language you'll learn to speak in both ( something I have to do
nearly daily).

FWIW other than the primary valid complaint remaining for Thrift over CQL
is modeling clustering columns in different nesting between rows is trivial
in Thrift and not really doable in CQL (clustering columns enforce a
nesting order by logical construct), I've yet to not be able to swap a
client from thrift to CQL ,and it's always ended up faster (so far).

The main reason for this is performance on modern Cassandra and the native
protocol is substantially better than pure thrift for many query types (see
http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) , so
your mileage may vary, but I'd test it out first before proclaiming that
thrift is faster for your use case (and make liberal use of cql features
with cassandra-cli to make sure you know what's going on internally,
remember it's all just sstables underneath).




On Tue, Dec 23, 2014 at 12:00 PM, David Broyles 
wrote:

> Thanks, Ryan.  I wasn't aware of static column support, and indeed they
> get me most of what I need.  I think the only potential inefficiency  is
> still at query time.  Using Thrift, I could design the column family to get
> the all the static and dynamic content in a single query.
> If event_source and total_events are instead implemented as CQL3 statics,
> I probably need to do two queries to get data for a given event_type
>
> To get event metadata (is the LIMIT 1 needed to reduce to 1 record?):
> SELECT event_source, total_events FROM timeseries WHERE event_type =
> 'some-type'
>
> To get the events:
> SELECT insertion_time, event FROM timeseries
>
> As a combined query, my concern is related to the overhead of repeating
> event_type/source/total_events (although with potentially many other pieces
> of static information).
>
> More generally, do you find that tuned applications tend to use Thrift, a
> combination of Thrift and CQL3, or is CQL3 really expected to replace
> Thrift?
>
> Thanks again!
>
> On Mon, Dec 22, 2014 at 9:50 PM, Ryan Svihla  wrote:
>
>> Don't static columns get you what you want?
>>
>>
>> http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
>>  On Dec 22, 2014 10:50 PM, "David Broyles"  wrote:
>>
>>> Although I used Cassandra 1.0.X extensively, I'm new to CQL3.  Pages
>>> such as http://wiki.apache.org/cassandra/ClientOptionsThrift suggest
>>> new projects should use CQL3.
>>>
>>> I'm wondering, however, if there are certain use cases not well covered
>>> by CQL3.  Consider the standard timeseries example:
>>>
>>> CREATE TABLE timeseries (
>>>event_type text,
>>>insertion_time timestamp,
>>>event blob,
>>>PRIMARY KEY (event_type, insertion_time)
>>> ) WITH CLUSTERING ORDER BY (insertion_time DESC);
>>>
>>> What happens if I want to store additional information that is shared by
>>> all events in the given series (but that I don't want to include in the row
>>> ID): e.g. the event source, a cached count of the number of events logged
>>> to date, etc.?  I might try updating the definition as follows:
>>>
>>> CREATE TABLE timeseries (
>>>event_type text,
>>>   event_source text,
>>>total_events int,
>>>insertion_time timestamp,
>>>event blob,
>>>PRIMARY KEY (event_type, event_source, total_events, insertion_time)
>>> ) WITH CLUSTERING ORDER BY (insertion_time DESC);
>>>
>>> Is this not inefficient?  When inserting or querying via CQL3, say in
>>> batches of up to 1000 events, won't the type/source/count be repeated 1000
>>> times?  Please let me know if I'm misunderstanding something, or if I
>>> should be sticking to Thrift for situations like this involving mixed
>>> static/dynamic data.
>>>
>>> Thanks!
>>>
>>
>


-- 

[image: datastax_logo.png] 

Ryan Svihla

Solution Architect

[image: twitter.png]  [image: linkedin.png]


DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone 

Re: CQL3 vs Thrift

2014-12-24 Thread Ryan Svihla
Peter,

Can you come up with some specifics? I'm always interested in finding more
corner cases, but it's also possible I have a modeling alternative that you
may not have considered yet, regardless it's good practice and background
for me.

On Tue, Dec 23, 2014 at 12:26 PM, Peter Lin  wrote:

>
> I'm bias in favor of using both thrift and CQL3, though many people on the
> list probably think I'm crazy.
>
> CQL3 is good if what you need fits nicely in static columns, but it
> doesn't if you want to use dynamic columns and/or mix & match both in the
> same columnFamily. For a lot of what I use Cassandra for, CQL3 currently
> doesn't provide all the functionality. It is possible to extend CQL3
> further to make it handle 100% of the use cases that Thrift supports today.
>
> whether that will happen is anyone's guess. SQL "like" syntax is popular
> and many people understand it, but it doesn't necessarily line up perfectly
> with NoSql column databases.
>
>
> On Tue, Dec 23, 2014 at 1:00 PM, David Broyles 
> wrote:
>
>> Thanks, Ryan.  I wasn't aware of static column support, and indeed they
>> get me most of what I need.  I think the only potential inefficiency  is
>> still at query time.  Using Thrift, I could design the column family to get
>> the all the static and dynamic content in a single query.
>> If event_source and total_events are instead implemented as CQL3 statics,
>> I probably need to do two queries to get data for a given event_type
>>
>> To get event metadata (is the LIMIT 1 needed to reduce to 1 record?):
>> SELECT event_source, total_events FROM timeseries WHERE event_type =
>> 'some-type'
>>
>> To get the events:
>> SELECT insertion_time, event FROM timeseries
>>
>> As a combined query, my concern is related to the overhead of repeating
>> event_type/source/total_events (although with potentially many other pieces
>> of static information).
>>
>> More generally, do you find that tuned applications tend to use Thrift, a
>> combination of Thrift and CQL3, or is CQL3 really expected to replace
>> Thrift?
>>
>> Thanks again!
>>
>> On Mon, Dec 22, 2014 at 9:50 PM, Ryan Svihla 
>> wrote:
>>
>>> Don't static columns get you what you want?
>>>
>>>
>>> http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
>>>  On Dec 22, 2014 10:50 PM, "David Broyles"  wrote:
>>>
 Although I used Cassandra 1.0.X extensively, I'm new to CQL3.  Pages
 such as http://wiki.apache.org/cassandra/ClientOptionsThrift suggest
 new projects should use CQL3.

 I'm wondering, however, if there are certain use cases not well covered
 by CQL3.  Consider the standard timeseries example:

 CREATE TABLE timeseries (
event_type text,
insertion_time timestamp,
event blob,
PRIMARY KEY (event_type, insertion_time)
 ) WITH CLUSTERING ORDER BY (insertion_time DESC);

 What happens if I want to store additional information that is shared
 by all events in the given series (but that I don't want to include in the
 row ID): e.g. the event source, a cached count of the number of events
 logged to date, etc.?  I might try updating the definition as follows:

 CREATE TABLE timeseries (
event_type text,
   event_source text,
total_events int,
insertion_time timestamp,
event blob,
PRIMARY KEY (event_type, event_source, total_events, insertion_time)
 ) WITH CLUSTERING ORDER BY (insertion_time DESC);

 Is this not inefficient?  When inserting or querying via CQL3, say in
 batches of up to 1000 events, won't the type/source/count be repeated 1000
 times?  Please let me know if I'm misunderstanding something, or if I
 should be sticking to Thrift for situations like this involving mixed
 static/dynamic data.

 Thanks!

>>>
>>
>


-- 

[image: datastax_logo.png] 

Ryan Svihla

Solution Architect

[image: twitter.png]  [image: linkedin.png]


DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: CQL3 vs Thrift

2014-12-24 Thread Peter Lin

I've listed several in the past, I won't bother repeating it again.

Just search the mailing list 

Sent from my iPhone

> On Dec 24, 2014, at 8:30 AM, Ryan Svihla  wrote:
> 
> Peter,
> 
> Can you come up with some specifics? I'm always interested in finding more 
> corner cases, but it's also possible I have a modeling alternative that you 
> may not have considered yet, regardless it's good practice and background for 
> me.
> 
>> On Tue, Dec 23, 2014 at 12:26 PM, Peter Lin  wrote:
>> 
>> I'm bias in favor of using both thrift and CQL3, though many people on the 
>> list probably think I'm crazy.
>> 
>> CQL3 is good if what you need fits nicely in static columns, but it doesn't 
>> if you want to use dynamic columns and/or mix & match both in the same 
>> columnFamily. For a lot of what I use Cassandra for, CQL3 currently doesn't 
>> provide all the functionality. It is possible to extend CQL3 further to make 
>> it handle 100% of the use cases that Thrift supports today.
>> 
>> whether that will happen is anyone's guess. SQL "like" syntax is popular and 
>> many people understand it, but it doesn't necessarily line up perfectly with 
>> NoSql column databases.
>> 
>> 
>>> On Tue, Dec 23, 2014 at 1:00 PM, David Broyles  wrote:
>>> Thanks, Ryan.  I wasn't aware of static column support, and indeed they get 
>>> me most of what I need.  I think the only potential inefficiency  is still 
>>> at query time.  Using Thrift, I could design the column family to get the 
>>> all the static and dynamic content in a single query.  
>>> If event_source and total_events are instead implemented as CQL3 statics, I 
>>> probably need to do two queries to get data for a given event_type
>>> 
>>> To get event metadata (is the LIMIT 1 needed to reduce to 1 record?): 
>>> SELECT event_source, total_events FROM timeseries WHERE event_type = 
>>> 'some-type'
>>> 
>>> To get the events:
>>> SELECT insertion_time, event FROM timeseries
>>> 
>>> As a combined query, my concern is related to the overhead of repeating 
>>> event_type/source/total_events (although with potentially many other pieces 
>>> of static information).
>>> 
>>> More generally, do you find that tuned applications tend to use Thrift, a 
>>> combination of Thrift and CQL3, or is CQL3 really expected to replace 
>>> Thrift?
>>> 
>>> Thanks again!
>>> 
 On Mon, Dec 22, 2014 at 9:50 PM, Ryan Svihla  wrote:
 Don't static columns get you what you want?
 
 http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
> On Dec 22, 2014 10:50 PM, "David Broyles"  wrote:
> Although I used Cassandra 1.0.X extensively, I'm new to CQL3.  Pages such 
> as http://wiki.apache.org/cassandra/ClientOptionsThrift suggest new 
> projects should use CQL3.
> 
> I'm wondering, however, if there are certain use cases not well covered 
> by CQL3.  Consider the standard timeseries example:
> 
> CREATE TABLE timeseries (
>event_type text,
>insertion_time timestamp,
>event blob,
>PRIMARY KEY (event_type, insertion_time)
> ) WITH CLUSTERING ORDER BY (insertion_time DESC);
> 
> What happens if I want to store additional information that is shared by 
> all events in the given series (but that I don't want to include in the 
> row ID): e.g. the event source, a cached count of the number of events 
> logged to date, etc.?  I might try updating the definition as follows:
> 
> CREATE TABLE timeseries (
>event_type text,
>   event_source text,
>total_events int,
>insertion_time timestamp,
>event blob,
>PRIMARY KEY (event_type, event_source, total_events, insertion_time)
> ) WITH CLUSTERING ORDER BY (insertion_time DESC);
> 
> Is this not inefficient?  When inserting or querying via CQL3, say in 
> batches of up to 1000 events, won't the type/source/count be repeated 
> 1000 times?  Please let me know if I'm misunderstanding something, or if 
> I should be sticking to Thrift for situations like this involving mixed 
> static/dynamic data.
> 
> Thanks!
> 
> 
> 
> -- 
> 
> Ryan Svihla
> Solution Architect
> 
>  
> 
> DataStax is the fastest, most scalable distributed database technology, 
> delivering Apache Cassandra to the world’s most innovative enterprises. 
> Datastax is built to be agile, always-on, and predictably scalable to any 
> size. With more than 500 customers in 45 countries, DataStax is the database 
> technology and transactional backbone of choice for the worlds most 
> innovative companies such as Netflix, Adobe, Intuit, and eBay. 
> 


Re: CQL3 vs Thrift

2014-12-24 Thread Kai Wang
Ryan,

Can you elaborate a little on "Thrift over CQL is modeling clustering
columns in different nesting between rows is trivial in Thrift and not
really doable in CQL"?
On Dec 24, 2014 8:30 AM, "Ryan Svihla"  wrote:

> I'm not entirely certain how you can't model that to solve your use case
> (wouldn't you be filtering the events as well, and therefore be able to get
> all that in one query).
>
>  What you describe there has a number of avenues (collections, just
> heavier use of statics in a different order than you specified, object dump
> of events in a single column, switching up the clustering columns) of
> getting your question answered in one query. End of the day cql resolves to
> a given SStable format, you can still open up cassandra-cli and view what a
> given model looks like, when you've grokked this adequately you basically
> can bend CQL to fit your logical thrift modeling, at some point like
> learning any new language you'll learn to speak in both ( something I have
> to do nearly daily).
>
> FWIW other than the primary valid complaint remaining for Thrift over CQL
> is modeling clustering columns in different nesting between rows is trivial
> in Thrift and not really doable in CQL (clustering columns enforce a
> nesting order by logical construct), I've yet to not be able to swap a
> client from thrift to CQL ,and it's always ended up faster (so far).
>
> The main reason for this is performance on modern Cassandra and the native
> protocol is substantially better than pure thrift for many query types (see
> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) , so
> your mileage may vary, but I'd test it out first before proclaiming that
> thrift is faster for your use case (and make liberal use of cql features
> with cassandra-cli to make sure you know what's going on internally,
> remember it's all just sstables underneath).
>
>
>
>
> On Tue, Dec 23, 2014 at 12:00 PM, David Broyles 
> wrote:
>
>> Thanks, Ryan.  I wasn't aware of static column support, and indeed they
>> get me most of what I need.  I think the only potential inefficiency  is
>> still at query time.  Using Thrift, I could design the column family to get
>> the all the static and dynamic content in a single query.
>> If event_source and total_events are instead implemented as CQL3 statics,
>> I probably need to do two queries to get data for a given event_type
>>
>> To get event metadata (is the LIMIT 1 needed to reduce to 1 record?):
>> SELECT event_source, total_events FROM timeseries WHERE event_type =
>> 'some-type'
>>
>> To get the events:
>> SELECT insertion_time, event FROM timeseries
>>
>> As a combined query, my concern is related to the overhead of repeating
>> event_type/source/total_events (although with potentially many other pieces
>> of static information).
>>
>> More generally, do you find that tuned applications tend to use Thrift, a
>> combination of Thrift and CQL3, or is CQL3 really expected to replace
>> Thrift?
>>
>> Thanks again!
>>
>> On Mon, Dec 22, 2014 at 9:50 PM, Ryan Svihla 
>> wrote:
>>
>>> Don't static columns get you what you want?
>>>
>>>
>>> http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
>>>  On Dec 22, 2014 10:50 PM, "David Broyles"  wrote:
>>>
 Although I used Cassandra 1.0.X extensively, I'm new to CQL3.  Pages
 such as http://wiki.apache.org/cassandra/ClientOptionsThrift suggest
 new projects should use CQL3.

 I'm wondering, however, if there are certain use cases not well covered
 by CQL3.  Consider the standard timeseries example:

 CREATE TABLE timeseries (
event_type text,
insertion_time timestamp,
event blob,
PRIMARY KEY (event_type, insertion_time)
 ) WITH CLUSTERING ORDER BY (insertion_time DESC);

 What happens if I want to store additional information that is shared
 by all events in the given series (but that I don't want to include in the
 row ID): e.g. the event source, a cached count of the number of events
 logged to date, etc.?  I might try updating the definition as follows:

 CREATE TABLE timeseries (
event_type text,
   event_source text,
total_events int,
insertion_time timestamp,
event blob,
PRIMARY KEY (event_type, event_source, total_events, insertion_time)
 ) WITH CLUSTERING ORDER BY (insertion_time DESC);

 Is this not inefficient?  When inserting or querying via CQL3, say in
 batches of up to 1000 events, won't the type/source/count be repeated 1000
 times?  Please let me know if I'm misunderstanding something, or if I
 should be sticking to Thrift for situations like this involving mixed
 static/dynamic data.

 Thanks!

>>>
>>
>
>
> --
>
> [image: datastax_logo.png] 
>
> Ryan Svihla
>
> Solution Architect
>
> [image: twitter.png]  [image: linkedin.png]
> 

Re: CQL3 vs Thrift

2014-12-24 Thread Peter Lin
basically any time you want to store maps of maps, lists of lists or actual
java objects, CQL is not a good fit. CQL is really only good for primitive
types, flat lists, maps and sets.

Using Cassandra pure with static columns is perfectly valid, but I don't
live in that world. Most of what I do requires dynamic columns mixed with
static columns in a single column family. This will sounds like heresy, but
an use case that fits perfectly in SQL model, you're better off using
something like VoltDB which gives you 100% SQL with ACID.



On Wed, Dec 24, 2014 at 10:38 AM, Kai Wang  wrote:

> Ryan,
>
> Can you elaborate a little on "Thrift over CQL is modeling clustering
> columns in different nesting between rows is trivial in Thrift and not
> really doable in CQL"?
> On Dec 24, 2014 8:30 AM, "Ryan Svihla"  wrote:
>
>> I'm not entirely certain how you can't model that to solve your use case
>> (wouldn't you be filtering the events as well, and therefore be able to get
>> all that in one query).
>>
>>  What you describe there has a number of avenues (collections, just
>> heavier use of statics in a different order than you specified, object dump
>> of events in a single column, switching up the clustering columns) of
>> getting your question answered in one query. End of the day cql resolves to
>> a given SStable format, you can still open up cassandra-cli and view what a
>> given model looks like, when you've grokked this adequately you basically
>> can bend CQL to fit your logical thrift modeling, at some point like
>> learning any new language you'll learn to speak in both ( something I have
>> to do nearly daily).
>>
>> FWIW other than the primary valid complaint remaining for Thrift over CQL
>> is modeling clustering columns in different nesting between rows is trivial
>> in Thrift and not really doable in CQL (clustering columns enforce a
>> nesting order by logical construct), I've yet to not be able to swap a
>> client from thrift to CQL ,and it's always ended up faster (so far).
>>
>> The main reason for this is performance on modern Cassandra and the
>> native protocol is substantially better than pure thrift for many query
>> types (see
>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) , so
>> your mileage may vary, but I'd test it out first before proclaiming that
>> thrift is faster for your use case (and make liberal use of cql features
>> with cassandra-cli to make sure you know what's going on internally,
>> remember it's all just sstables underneath).
>>
>>
>>
>>
>> On Tue, Dec 23, 2014 at 12:00 PM, David Broyles 
>> wrote:
>>
>>> Thanks, Ryan.  I wasn't aware of static column support, and indeed they
>>> get me most of what I need.  I think the only potential inefficiency  is
>>> still at query time.  Using Thrift, I could design the column family to get
>>> the all the static and dynamic content in a single query.
>>> If event_source and total_events are instead implemented as CQL3
>>> statics, I probably need to do two queries to get data for a given
>>> event_type
>>>
>>> To get event metadata (is the LIMIT 1 needed to reduce to 1 record?):
>>> SELECT event_source, total_events FROM timeseries WHERE event_type =
>>> 'some-type'
>>>
>>> To get the events:
>>> SELECT insertion_time, event FROM timeseries
>>>
>>> As a combined query, my concern is related to the overhead of repeating
>>> event_type/source/total_events (although with potentially many other pieces
>>> of static information).
>>>
>>> More generally, do you find that tuned applications tend to use Thrift,
>>> a combination of Thrift and CQL3, or is CQL3 really expected to replace
>>> Thrift?
>>>
>>> Thanks again!
>>>
>>> On Mon, Dec 22, 2014 at 9:50 PM, Ryan Svihla 
>>> wrote:
>>>
 Don't static columns get you what you want?


 http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
  On Dec 22, 2014 10:50 PM, "David Broyles" 
 wrote:

> Although I used Cassandra 1.0.X extensively, I'm new to CQL3.  Pages
> such as http://wiki.apache.org/cassandra/ClientOptionsThrift suggest
> new projects should use CQL3.
>
> I'm wondering, however, if there are certain use cases not well
> covered by CQL3.  Consider the standard timeseries example:
>
> CREATE TABLE timeseries (
>event_type text,
>insertion_time timestamp,
>event blob,
>PRIMARY KEY (event_type, insertion_time)
> ) WITH CLUSTERING ORDER BY (insertion_time DESC);
>
> What happens if I want to store additional information that is shared
> by all events in the given series (but that I don't want to include in the
> row ID): e.g. the event source, a cached count of the number of events
> logged to date, etc.?  I might try updating the definition as follows:
>
> CREATE TABLE timeseries (
>event_type text,
>   event_source text,
>total_events int,
>insertion_time timestamp,
>event 

Re: CQL3 vs Thrift

2014-12-24 Thread Eric Stevens
As Ryan mentioned, CQL is simply a translation layer to the underlying
storage mechanism you're already familiar with with Thrift.

There are definitely corner cases where it's not possible to get a
one-for-one equivalent in CQL vs Thrift, and even when there's equivalents,
the underlying data might not look exactly the same (eg, if you used string
composites instead of native composites, or several mixed composite types,
and so on).

CQL is not meant to provide SQL equivalency.  It's not only missing many
SQL constructs, it's also got a number of unique constructs of its own.
It's meant to be familiar looking to people comfortable with SQL, but you
cannot reason about it the same way.

Everyone is of course free to use the access layer they prefer, but
personally I would recommend building all new features using a CQL oriented
approach.  The Thrift interface is frozen, it will not get new features,
and there are some really awesome features already released only for CQL,
and more are coming.  Find a path that works for you in CQL; we had to
change our thinking about a number of things, but it's worth the effort.

On Wed, Dec 24, 2014 at 8:48 AM, Peter Lin  wrote:

>
> basically any time you want to store maps of maps, lists of lists or
> actual java objects, CQL is not a good fit. CQL is really only good for
> primitive types, flat lists, maps and sets.
>
> Using Cassandra pure with static columns is perfectly valid, but I don't
> live in that world. Most of what I do requires dynamic columns mixed with
> static columns in a single column family. This will sounds like heresy, but
> an use case that fits perfectly in SQL model, you're better off using
> something like VoltDB which gives you 100% SQL with ACID.
>
>
>
> On Wed, Dec 24, 2014 at 10:38 AM, Kai Wang  wrote:
>
>> Ryan,
>>
>> Can you elaborate a little on "Thrift over CQL is modeling clustering
>> columns in different nesting between rows is trivial in Thrift and not
>> really doable in CQL"?
>> On Dec 24, 2014 8:30 AM, "Ryan Svihla"  wrote:
>>
>>> I'm not entirely certain how you can't model that to solve your use case
>>> (wouldn't you be filtering the events as well, and therefore be able to get
>>> all that in one query).
>>>
>>>  What you describe there has a number of avenues (collections, just
>>> heavier use of statics in a different order than you specified, object dump
>>> of events in a single column, switching up the clustering columns) of
>>> getting your question answered in one query. End of the day cql resolves to
>>> a given SStable format, you can still open up cassandra-cli and view what a
>>> given model looks like, when you've grokked this adequately you basically
>>> can bend CQL to fit your logical thrift modeling, at some point like
>>> learning any new language you'll learn to speak in both ( something I have
>>> to do nearly daily).
>>>
>>> FWIW other than the primary valid complaint remaining for Thrift over
>>> CQL is modeling clustering columns in different nesting between rows is
>>> trivial in Thrift and not really doable in CQL (clustering columns enforce
>>> a nesting order by logical construct), I've yet to not be able to swap a
>>> client from thrift to CQL ,and it's always ended up faster (so far).
>>>
>>> The main reason for this is performance on modern Cassandra and the
>>> native protocol is substantially better than pure thrift for many query
>>> types (see
>>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) , so
>>> your mileage may vary, but I'd test it out first before proclaiming that
>>> thrift is faster for your use case (and make liberal use of cql features
>>> with cassandra-cli to make sure you know what's going on internally,
>>> remember it's all just sstables underneath).
>>>
>>>
>>>
>>>
>>> On Tue, Dec 23, 2014 at 12:00 PM, David Broyles 
>>> wrote:
>>>
 Thanks, Ryan.  I wasn't aware of static column support, and indeed they
 get me most of what I need.  I think the only potential inefficiency  is
 still at query time.  Using Thrift, I could design the column family to get
 the all the static and dynamic content in a single query.
 If event_source and total_events are instead implemented as CQL3
 statics, I probably need to do two queries to get data for a given
 event_type

 To get event metadata (is the LIMIT 1 needed to reduce to 1 record?):
 SELECT event_source, total_events FROM timeseries WHERE event_type =
 'some-type'

 To get the events:
 SELECT insertion_time, event FROM timeseries

 As a combined query, my concern is related to the overhead of repeating
 event_type/source/total_events (although with potentially many other pieces
 of static information).

 More generally, do you find that tuned applications tend to use Thrift,
 a combination of Thrift and CQL3, or is CQL3 really expected to replace
 Thrift?

 Thanks again!

 On Mon, Dec 22, 2014 at 9:50 PM, Ryan S

Re: CQL3 vs Thrift

2014-12-24 Thread Peter Lin
@Eric  - totally agree. People should choose what is most comfortable for
them, but they should also take time to learn both and really understand
Cassandra at a deep level. Same is true of any database, even if most
people don't bother to read and understand how a piece of technology works.
I've seen some people confused about Cassandra, especially if they go to
github and see the description. new people could get the wrong impression

https://github.com/apache/cassandra

"Row store  means that like
relational databases, Cassandra organizes data by rows and columns. The
Cassandra Query Language (CQL) is a close relative of SQL."



On Wed, Dec 24, 2014 at 11:43 AM, Eric Stevens  wrote:

> As Ryan mentioned, CQL is simply a translation layer to the underlying
> storage mechanism you're already familiar with with Thrift.
>
> There are definitely corner cases where it's not possible to get a
> one-for-one equivalent in CQL vs Thrift, and even when there's equivalents,
> the underlying data might not look exactly the same (eg, if you used string
> composites instead of native composites, or several mixed composite types,
> and so on).
>
> CQL is not meant to provide SQL equivalency.  It's not only missing many
> SQL constructs, it's also got a number of unique constructs of its own.
> It's meant to be familiar looking to people comfortable with SQL, but you
> cannot reason about it the same way.
>
> Everyone is of course free to use the access layer they prefer, but
> personally I would recommend building all new features using a CQL oriented
> approach.  The Thrift interface is frozen, it will not get new features,
> and there are some really awesome features already released only for CQL,
> and more are coming.  Find a path that works for you in CQL; we had to
> change our thinking about a number of things, but it's worth the effort.
>
> On Wed, Dec 24, 2014 at 8:48 AM, Peter Lin  wrote:
>
>>
>> basically any time you want to store maps of maps, lists of lists or
>> actual java objects, CQL is not a good fit. CQL is really only good for
>> primitive types, flat lists, maps and sets.
>>
>> Using Cassandra pure with static columns is perfectly valid, but I don't
>> live in that world. Most of what I do requires dynamic columns mixed with
>> static columns in a single column family. This will sounds like heresy, but
>> an use case that fits perfectly in SQL model, you're better off using
>> something like VoltDB which gives you 100% SQL with ACID.
>>
>>
>>
>> On Wed, Dec 24, 2014 at 10:38 AM, Kai Wang  wrote:
>>
>>> Ryan,
>>>
>>> Can you elaborate a little on "Thrift over CQL is modeling clustering
>>> columns in different nesting between rows is trivial in Thrift and not
>>> really doable in CQL"?
>>> On Dec 24, 2014 8:30 AM, "Ryan Svihla"  wrote:
>>>
 I'm not entirely certain how you can't model that to solve your use
 case (wouldn't you be filtering the events as well, and therefore be able
 to get all that in one query).

  What you describe there has a number of avenues (collections, just
 heavier use of statics in a different order than you specified, object dump
 of events in a single column, switching up the clustering columns) of
 getting your question answered in one query. End of the day cql resolves to
 a given SStable format, you can still open up cassandra-cli and view what a
 given model looks like, when you've grokked this adequately you basically
 can bend CQL to fit your logical thrift modeling, at some point like
 learning any new language you'll learn to speak in both ( something I have
 to do nearly daily).

 FWIW other than the primary valid complaint remaining for Thrift over
 CQL is modeling clustering columns in different nesting between rows is
 trivial in Thrift and not really doable in CQL (clustering columns enforce
 a nesting order by logical construct), I've yet to not be able to swap a
 client from thrift to CQL ,and it's always ended up faster (so far).

 The main reason for this is performance on modern Cassandra and the
 native protocol is substantially better than pure thrift for many query
 types (see
 http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) ,
 so your mileage may vary, but I'd test it out first before proclaiming that
 thrift is faster for your use case (and make liberal use of cql features
 with cassandra-cli to make sure you know what's going on internally,
 remember it's all just sstables underneath).




 On Tue, Dec 23, 2014 at 12:00 PM, David Broyles 
 wrote:

> Thanks, Ryan.  I wasn't aware of static column support, and indeed
> they get me most of what I need.  I think the only potential inefficiency
>  is still at query time.  Using Thrift, I could design the column family 
> to
> get the all the static and dynamic content in a single qu

Nodes Dying in 2.1.2

2014-12-24 Thread Phil Burress
Just upgraded our cluster from 2.1.1 to 2.1.2 and our nodes keep dying. The
kernel is killing the process due to out of memory:

kernel:  Out of memory: Kill process 6267 (java) score 998 or sacrifice
child

Appears to only occur during compactions. We've tried playing with the heap
settings but nothing has worked thus far. We did not have this issue until
we upgraded. Anyone else run into this or have suggestions?

Thanks!

Phil