Re: confirm subscribe to user@cassandra.apache.org

2013-09-03 Thread Lee Bowyer
On Tue, 3 Sep 2013 09:56:46 +
 wrote:

> Hi! This is the ezmlm program. I'm managing the
> user@cassandra.apache.org mailing list.
> 
> To confirm that you would like
> 
>l...@icritical.com
> 
> added to the user mailing list, please send
> a short reply to this address:
> 
>
> user-sc.1378202206.mlbpekpacfeeaimadhpp-leeb=icritical@cassandra.apache.org
> 
> Usually, this happens when you just hit the "reply" button.
> If this does not work, simply copy the address and paste it into
> the "To:" field of a new message.
> 
> This confirmation serves two purposes. First, it verifies that I am able
> to get mail through to you. Second, it protects you in case someone
> forges a subscription request in your name.
> 
> Please note that ALL Apache dev- and user- mailing lists are publicly
> archived.  Do familiarize yourself with Apache's public archive policy at
> 
> http://www.apache.org/foundation/public-archives.html
> 
> prior to subscribing and posting messages to user@cassandra.apache.org.
> If you're not sure whether or not the policy applies to this mailing list,
> assume it does unless the list name contains the word "private" in it.
> 
> Some mail programs are broken and cannot handle long addresses. If you
> cannot reply to this request, instead send a message to
>  and put the
> entire address listed above into the "Subject:" line.
> 
> 
> --- Administrative commands for the user list ---
> 
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
> 
> To subscribe to the list, send a message to:
>
> 
> To remove your address from the list, send a message to:
>
> 
> Send mail to the following for info and FAQ for this list:
>
>
> 
> Similar addresses exist for the digest list:
>
>
> 
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>
> 
> To get an index with subject and author for messages 123-456 , mail:
>
> 
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
> 
> To receive all messages with the same subject as message 12345,
> send a short message to:
>
> 
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
> 
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> 
> 
> To stop subscription for this address, mail:
> 
> 
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
> 
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> user-ow...@cassandra.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
> 
> --- Enclosed is a copy of the request I received.
> 
> Return-Path: 
> Received: (qmail 2647 invoked by uid 99); 3 Sep 2013 09:56:45 -
> Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
> by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Sep 2013 09:56:45 +
> X-ASF-Spam-Status: No, hits=-2.0 required=10.0
>   tests=ASF_LIST_OPS,SPF_PASS
> X-Spam-Check-By: apache.org
> Received-SPF: pass (athena.apache.org: domain of l...@icritical.com 
> designates 212.57.248.143 as permitted sender)
> Received: from [212.57.248.143] (HELO mail3.icritical.com) (212.57.248.143)
> by apache.org (qpsmtpd/0.29) with SMTP; Tue, 03 Sep 2013 09:56:39 +
> Received: (qmail 5017 invoked from network); 3 Sep 2013 09:56:32 -
> Received: from localhost (127.0.0.1)
>   by mail3.icritical.com with SMTP; 3 Sep 2013 09:56:32 -
> Received: (qmail 4941 invoked by uid 599); 3 Sep 2013 09:56:22 -
> Received: from unknown (HELO PDC002.icritical.int) (195.62.218.2)
> by mail3.icritical.com (qpsmtpd/0.28) with ESMTP; Tue, 03 Sep 2013 
> 10:56:22 +0100
> Date: Tue, 3 Sep 2013 10:56:05 +0100
> From: Lee Bowyer 
> To: 
> Subject: subscribe
> Message-ID: <20130903105605.517615e7@freestyle>
> X-Mailer: Claws Mail 3.8.0 (GTK+ 2.24.10; x86_64-pc-linux-gnu)
> MIME-Version: 1.0
> Content-Type: text/plain; charset="US-ASCII"
> Content-Transfer-Encoding: 7bit
> X-TLS-Incoming: YES
> X-Virus-Scanned: by iCritical at mail3.icritical.com
> X-Virus-Checked: Checked by ClamAV on apache.org
> 
iCritical is a brand of Critical Software Ltd. 
Registered in England & Wales: 04909220.
Registered Office: IC2, Keele Science Park, Keele, Staffordshire, ST5 5NH.

This message has been scanned for security threats by iCritical. 

The information contained in this message is confidential and intended for the 
addressee only. 
If you have received this message in error, or there are any problems with its 
content, please 
contact the sender.


Gradle script to execute cql3 scripts

2013-09-03 Thread dawood abdullah
I have a requirement to execute CQL3 scripts through Gradle, do we have any
cassandra plugin for Gradle to do the same or is there any other way I can
execute CQL3 scripts during the build itself. Please suggest.

Dawood


read ?

2013-09-03 Thread Langston, Jim
Hi all,

Quick question

I currently am looking at a 4 node cluster and I have currently stopped all 
writing to
Cassandra,  with the reads continuing. I'm trying to understand the utilization
of memory within the JVM. nodetool info on each of the nodes shows them all
growing in footprint, 2 of the three at a greater rate. On the restart of 
Cassandra
each were at about 100MB, after 2 days, each of the following are at:

Heap Memory (MB) : 798.41 / 3052.00

Heap Memory (MB) : 370.44 / 3052.00

Heap Memory (MB) : 549.73 / 3052.00

Heap Memory (MB) : 481.89 / 3052.00

Ring configuration:

Address RackStatus State   LoadOwns
Token
   
127605887595351923798765477786913079296
x 1d  Up Normal  4.38 GB 25.00%  0
x   1d  Up Normal  4.17 GB 25.00%  
42535295865117307932921825928971026432
x   1d  Up Normal  4.19 GB 25.00%  
85070591730234615865843651857942052864
x   1d  Up Normal  4.14 GB 25.00%  
127605887595351923798765477786913079296


What I'm not sure of is what the growth is different between each ? and why
that growth is being created by activity that is read only.

Is Cassandra caching and holding the read data ?

I currently have caching turned off for the key/row. Also as part of the info 
command

Key Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN 
recent hit rate, 14400 save period in seconds
Row Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN 
recent hit rate, 0 save period in seconds



Thanks,

Jim


[RELEASE] Apache Cassandra 2.0 released

2013-09-03 Thread Sylvain Lebresne
The Cassandra team is very pleased to announce the release of Apache
Cassandra
version 2.0.0. Cassandra 2.0.0 is a new major release that adds numerous
improvements[1,2], including:
  - Lightweight transactions[4] that offers linearizable consistency.
  - Experimental Triggers Support[5].
  - Numerous enhancements to CQL as well as a new and better version of the
native protocol[6].
  - Compaction improvements[7] (including a hybrid strategy that combines
leveled and size-tiered compaction).
  - A new faster Thrift Server implementation based on LMAX Disruptor[8].
  - Eager retries: avoids query timeout by sending data requests to other
replicas if too much time passes on the original request.

See the full changelog[1] for more and please make sure to check the release
notes[2] for upgrading details.

Both source and binary distributions of Cassandra 2.0.0 can be downloaded
at:

 http://cassandra.apache.org/download/

As usual, a debian package is available from the project APT repository[3]
(you will need to use the 20x series).

The Cassandra team

[1]: http://goo.gl/zU4sWv (CHANGES.txt)
[2]: http://goo.gl/MrR6Qn (NEWS.txt)
[3]: http://wiki.apache.org/cassandra/DebianPackaging
[4]:
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
[5]:
http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-0-prototype-triggers-support
[6]: http://www.datastax.com/dev/blog/cql-in-cassandra-2-0
[7]: https://issues.apache.org/jira/browse/CASSANDRA-5371
[8]: https://issues.apache.org/jira/browse/CASSANDRA-5582


Re: CqlStorage creates wrong schema for Pig

2013-09-03 Thread Chad Johnston
You're trying to use FromCqlColumn on a tuple that has been flattened. The
schema still thinks it's {title: chararray}, but the flattened tuple is now
two values. I don't know how to retrieve the data values in this case.

Your code will work correctly if you do this:
*values3 = FOREACH rows GENERATE FromCqlColumn(title) AS title;*
*dump values3;*
*describe values3;*

(Use FromCqlColumn on the original data, not the flattened data.)

Chad


On Mon, Sep 2, 2013 at 8:45 AM, Miguel Angel Martin junquera <
mianmarjun.mailingl...@gmail.com> wrote:

> Hi
>
>
> 1.-
>
> May be?
>
> -- Register the UDF
> REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT
>
> -- FromCqlColumn will convert chararray, int, long, float, double
> DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();
>
> -- Load data as normal
> data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage();
>
> -- Use the UDF
> data = FOREACH data_raw GENERATE
> *FromCqlColumn*(isbn) AS ISBN,
> *FromCqlColumn*(bookauthor) AS BookAuthor,
>
> *FromCqlColumn*(booktitle) AS BookTitle,
> *FromCqlColumn*(publisher) AS Publisher,
>
> *FromCqlColumn*(yearofpublication) AS YearOfPublication;
>
>
>
>
>
> and  2.:
>
> with  the data in cql cassandra 1.2.8, pig 0.11.11 and cql3:
>
> *CREATE KEYSPACE keyspace1*
>
> *  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor'
> : 1 }*
>
> *  AND durable_writes = true;*
>
> *
> *
>
> *use keyspace2;*
>
> *
> *
>
> *  CREATE TABLE test (*
>
> *id text PRIMARY KEY,*
>
> *title text,*
>
> *age int*
>
> *  )  WITH COMPACT STORAGE;*
>
> *
> *
>
> *
> *
>
> *  insert into test (id, title, age) values('1', 'child', 21);*
>
> *  insert into test (id, title, age) values('2', 'support', 21);*
>
> *  insert into test (id, title, age) values('3', 'manager', 31);*
>
> *  insert into test (id, title, age) values('4', 'QA', 41);*
>
> *  insert into test (id, title, age) values('5', 'QA', 30);*
>
> *  insert into test (id, title, age) values('6', 'QA', 30);*
>
>
>
>
>
> and script:
>
> *
> *
> *register './libs/cqlstorageudf-1.0-SNAPSHOT.jar';*
> *DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();*
> *rows = LOAD
> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
> CqlStorage();*
> *dump rows;*
> *ILLUSTRATE rows;*
> *describe rows;*
> *A = FOREACH rows GENERATE FLATTEN(title);*
> *dump A;*
> *values3 = FOREACH A GENERATE FromCqlColumn(title) AS title;*
> *dump values3;*
> *describe values3;*
>
>
> --
>
>
>
> I have this error:
>
>
>
>
> 
>
> -
> | rows | id:chararray   | age:int   | title:chararray   |
> -
> |  | (id, 5)| (age, 30) | (title, QA)   |
> -
>
> rows: {id: chararray,age: int,title: chararray}
>
>
> ...
>
> (title,QA)
> (title,QA)
> ..
> 2013-09-02 16:40:52,454 [Thread-11] WARN
>  org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
> *java.lang.ClassCastException: java.lang.String cannot be cast to
> org.apache.pig.data.Tuple*
> at com.megatome.pig.piggybank.tuple.ColumnBase.exec(ColumnBase.java:32)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>  at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 2013-09-02 16:40:52,832 [main] INFO
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_local_0003
>
>
>
> 8-|
>
> Regards
>
> ...
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.mar...@brainsins.com
>
>
>
> 2013/9/2 Miguel Angel Martin junquera 
>
>> hi all:
>>
>> More info :
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-5941
>>
>>
>>
>> I tried this (and gen. cassandra 1.2.9)  but do not wor

Re: row cache

2013-09-03 Thread Chris Burroughs

On 09/01/2013 03:06 PM, Faraaz Sareshwala wrote:

Yes, that is correct.

The SerializingCacheProvider stores row cache contents off heap. I believe you
need JNA enabled for this though. Someone please correct me if I am wrong here.

The ConcurrentLinkedHashCacheProvider stores row cache contents on the java heap
itself.



Naming things is hard.  Both caches are in memory and are backed by a 
ConcurrentLinkekHashMap.  In the case of the SerializingCacheProvider 
the *values* are stored in off heap buffers.  Both must store a half 
dozen or so objects (on heap) per entry 
(org.apache.cassandra.cache.RowCacheKey, 
com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue, 
java.util.concurrent.ConcurrentHashMap$HashEntry, etc).  It would 
probably be better to call this a "mixed-heap" rather than off-heap 
cache.  You may find the number of entires you can hold without gc 
problems to be surprising low (relative to say memcached, or physical 
memory on modern hardware).


Invalidating a column with SerializingCacheProvider invalidates the 
entire row while with ConcurrentLinkedHashCacheProvider it does not. 
SerializingCacheProvider does not require JNA.


Both also use memory estimation of the size (of the values only) to 
determine the total number of entries retained.  Estimating the size of 
the totally on-heap ConcurrentLinkedHashCacheProvider has historically 
been dicey since we switched from sizing in entries, and it has been 
removed in 2.0.0.


As said elsewhere in this thread the utility of the row cache varies 
from "absolutely essential" to "source of numerous problems" depending 
on the specifics of the data model and request distribution.





RE: read ?

2013-09-03 Thread Lohfink, Chris
To get an accurate picture you should force a full GC on each node, the heap 
utilization can be misleading since there can be a lot of things in the heap 
with no strong references.

There is a number of factors that can lead to this.  For a true comparison I 
would recommend using jconsole and call dumpHeap on 
com.sun.management:type=HotSpotDiagnostic with the 2nd param true (force GC).  
Then open the heap dump up in a tool like yourkit and you will get a better 
comparison and also it will tell you what it is that's taking the space.

Chris

From: Langston, Jim [mailto:jim.langs...@compuware.com]
Sent: Tuesday, September 03, 2013 8:20 AM
To: user@cassandra.apache.org
Subject: read ?

Hi all,

Quick question

I currently am looking at a 4 node cluster and I have currently stopped all 
writing to
Cassandra,  with the reads continuing. I'm trying to understand the utilization
of memory within the JVM. nodetool info on each of the nodes shows them all
growing in footprint, 2 of the three at a greater rate. On the restart of 
Cassandra
each were at about 100MB, after 2 days, each of the following are at:

Heap Memory (MB) : 798.41 / 3052.00

Heap Memory (MB) : 370.44 / 3052.00

Heap Memory (MB) : 549.73 / 3052.00

Heap Memory (MB) : 481.89 / 3052.00

Ring configuration:

Address RackStatus State   LoadOwns
Token
   
127605887595351923798765477786913079296
x 1d  Up Normal  4.38 GB 25.00%  0
x   1d  Up Normal  4.17 GB 25.00%  
42535295865117307932921825928971026432
x   1d  Up Normal  4.19 GB 25.00%  
85070591730234615865843651857942052864
x   1d  Up Normal  4.14 GB 25.00%  
127605887595351923798765477786913079296


What I'm not sure of is what the growth is different between each ? and why
that growth is being created by activity that is read only.

Is Cassandra caching and holding the read data ?

I currently have caching turned off for the key/row. Also as part of the info 
command

Key Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN 
recent hit rate, 14400 save period in seconds
Row Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN 
recent hit rate, 0 save period in seconds



Thanks,

Jim


Re: Recomended storage choice for Cassandra on Amazon m1.xlarge instance

2013-09-03 Thread Andrey Ilinykh
You benefit from putting commit log on separate drive only if this drive is
an isolated spinning device. EC2 ephemeral is a virtual device, so I don't
think it makes sense to put commit log on a separated drive. I would build
raid0 from 4 drives and put everything their. But it would be interesting
to compare different configurations.

Thank you,
   Andrey


On Mon, Sep 2, 2013 at 7:11 PM, Renat Gilfanov  wrote:

> Hello,
>
> I'd like to ask what is the best options of separating commit log and data
> on Amazon m1.xlarge instance, given 4x420 Gb attached storages and EBS
> volume ?
>
> As far as I understand, the EBS is not the choice and it's recomended to
> use attached storages instead.
> Is it better to combine 4 ephemeral drives in 2 raid0 (or raid1 ?), and
> store data on the first and commit log on the second? Or may be trying
> other combinations like 1 attached storage for commit log, and 3 others
> grouped in raid0 for data?
>
> Thank you.
>
>
>


Re: Versioning in cassandra

2013-09-03 Thread dawood abdullah
Jan,

The solution you gave works spot on, but there is one more requirement I
forgot to mention. Following is my table structure

CREATE TABLE file (
  id text,
  contenttype text,
  createdby text,
  createdtime timestamp,
  description text,
  name text,
  parentid text,
  version timestamp,
  PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC);


The query (select * from file where id = 'xxx' limit 1;) provided solves
the problem of finding the latest version file. But I have one more
requirement of finding all the latest version files having parentid say
'yyy'.

Please suggest how can this query be achieved.

Dawood



On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah
wrote:

> In my case version can be timestamp as well. What do you suggest version
> number to be, do you see any problems if I keep version as counter /
> timestamp ?
>
>
> On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen <
> jan.algermis...@nordsc.com> wrote:
>
>>
>> On 02.09.2013, at 20:44, dawood abdullah 
>> wrote:
>>
>> > Requirement is like I have a column family say File
>> >
>> > create table file(id text primary key, fname text, version int,
>> mimetype text, content text);
>> >
>> > Say, I have few records inserted, when I modify an existing record
>> (content is modified) a new version needs to be created. As I need to have
>> provision to revert to back any old version whenever required.
>> >
>>
>> So, can version be a timestamp? Or does it need to be an integer?
>>
>> In the former case, make use of C*'s ordering like so:
>>
>> CREATE TABLE file (
>>file_id text,
>>version timestamp,
>>fname text,
>>
>>PRIMARY KEY (file_id,version)
>> ) WITH CLUSTERING ORDER BY (version DESC);
>>
>> Get the latest file version with
>>
>> select * from file where file_id = 'xxx' limit 1;
>>
>> If it has to be an integer, use counter columns.
>>
>> Jan
>>
>>
>> > Regards,
>> > Dawood
>> >
>> >
>> > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen <
>> jan.algermis...@nordsc.com> wrote:
>> > Hi Dawood,
>> >
>> > On 02.09.2013, at 16:36, dawood abdullah 
>> wrote:
>> >
>> > > Hi
>> > > I have a requirement of versioning to be done in Cassandra.
>> > >
>> > > Following is my column family definition
>> > >
>> > > create table file_details(id text primary key, fname text, version
>> int, mimetype text);
>> > >
>> > > I have a secondary index created on fname column.
>> > >
>> > > Whenever I do an insert for the same 'fname', the version should be
>> incremented. And when I retrieve a row with fname it should return me the
>> latest version row.
>> > >
>> > > Is there a better way to do in Cassandra? Please suggest what
>> approach needs to be taken.
>> >
>> > Can you explain more about your use case?
>> >
>> > If the version need not be a small number, but could be a timestamp,
>> you could make use of C*'s ordering feature , have the database set the new
>> version as a timestamp and retrieve the latest one with a simple LIMIT 1
>> query. (I'll explain more when this is an option for you).
>> >
>> > Jan
>> >
>> > P.S. Me being a REST/HTTP head, an alarm rings when I see 'version'
>> next to 'mimetype' :-) What exactly are you versioning here? Maybe we can
>> even change the situation from a functional POV?
>> >
>> >
>> > >
>> > > Regards,
>> > >
>> > > Dawood
>> > >
>> > >
>> > >
>> > >
>> >
>> >
>>
>>
>


RE: read ?

2013-09-03 Thread Lohfink, Chris
Does it actually OOM eventually? There will be a certain amount of object 
allocation for reads (or anything) which will see the heap creep up until a GC, 
but at ~500mb or so of a 8gb heap there is little reason for the JVM to do it 
so it probably just ignores it to save processing.  Even the young gen wont 
require a collection at this size.

Which version of Cassandra are you running? Previous to 1.2 a lot of metadata 
about the sstables took considerable heap which could cause additional memory 
utilization.

Chris

From: Langston, Jim [mailto:jim.langs...@compuware.com]
Sent: Tuesday, September 03, 2013 11:33 AM
To: user@cassandra.apache.org
Subject: Re: read ?

Thanks Chris,

I have about 8 heap dumps that I have been looking at. I have been trying to 
isolate
as to why I have be dumping heap, I've started by removing the apps that write 
to
cassandra and eliminating work that would entail. I am left with just the apps 
that
are reading the data and from the heap dumps it looks like Cassandra Column 
methods
being called, because there are so many objects, it is difficult to ascertain 
exactly what
the problem may be. That prompted my query, trying to quickly determine if 
Cassandra
holds objects that have been used for reading, and if so, why, and more 
importantly if
something can be done.

Jim

From: "Lohfink, Chris" mailto:chris.lohf...@digi.com>>
Reply-To: mailto:user@cassandra.apache.org>>
Date: Tue, 3 Sep 2013 11:12:19 -0500
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: RE: read ?

To get an accurate picture you should force a full GC on each node, the heap 
utilization can be misleading since there can be a lot of things in the heap 
with no strong references.

There is a number of factors that can lead to this.  For a true comparison I 
would recommend using jconsole and call dumpHeap on 
com.sun.management:type=HotSpotDiagnostic with the 2nd param true (force GC).  
Then open the heap dump up in a tool like yourkit and you will get a better 
comparison and also it will tell you what it is that's taking the space.

Chris

From: Langston, Jim [mailto:jim.langs...@compuware.com]
Sent: Tuesday, September 03, 2013 8:20 AM
To: user@cassandra.apache.org
Subject: read ?

Hi all,

Quick question

I currently am looking at a 4 node cluster and I have currently stopped all 
writing to
Cassandra,  with the reads continuing. I'm trying to understand the utilization
of memory within the JVM. nodetool info on each of the nodes shows them all
growing in footprint, 2 of the three at a greater rate. On the restart of 
Cassandra
each were at about 100MB, after 2 days, each of the following are at:

Heap Memory (MB) : 798.41 / 3052.00

Heap Memory (MB) : 370.44 / 3052.00

Heap Memory (MB) : 549.73 / 3052.00

Heap Memory (MB) : 481.89 / 3052.00

Ring configuration:

Address RackStatus State   LoadOwns
Token
   
127605887595351923798765477786913079296
x 1d  Up Normal  4.38 GB 25.00%  0
x   1d  Up Normal  4.17 GB 25.00%  
42535295865117307932921825928971026432
x   1d  Up Normal  4.19 GB 25.00%  
85070591730234615865843651857942052864
x   1d  Up Normal  4.14 GB 25.00%  
127605887595351923798765477786913079296


What I'm not sure of is what the growth is different between each ? and why
that growth is being created by activity that is read only.

Is Cassandra caching and holding the read data ?

I currently have caching turned off for the key/row. Also as part of the info 
command

Key Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN 
recent hit rate, 14400 save period in seconds
Row Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN 
recent hit rate, 0 save period in seconds



Thanks,

Jim


Re: Upgrade from 1.0.9 to 1.2.8

2013-09-03 Thread Mike Neir
Ah. I was going by the upgrade recommendations in the NEWS.txt file in the 
cassandra source tree, which didn't make mention of that version (1.0.11) 
whatsoever. I didn't see any show-stoppers that would have prevented me from 
going straight from 1.0.9 to 1.2.x.


https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-1.2.4

Looks like a multi-step upgrade is the way I'll be proceeding. Thanks for the 
insight, everyone.


MN

On 09/02/2013 11:04 AM, Jeremiah D Jordan wrote:

1.0.9 -> 1.0.12 -> 1.1.12 -> 1.2.x?


Because this fix in 1.0.11:
* fix 1.0.x node join to mixed version cluster, other nodes >= 1.1 
(CASSANDRA-4195)

-Jeremiah


--



Mike Neir
Liquid Web, Inc.
Infrastructure Administrator



Re: Versioning in cassandra

2013-09-03 Thread Vivek Mishra
create secondary index over parentid.
OR
make it part of clustering key

-Vivek


On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah
wrote:

> Jan,
>
> The solution you gave works spot on, but there is one more requirement I
> forgot to mention. Following is my table structure
>
> CREATE TABLE file (
>   id text,
>   contenttype text,
>   createdby text,
>   createdtime timestamp,
>   description text,
>   name text,
>   parentid text,
>   version timestamp,
>   PRIMARY KEY (id, version)
>
> ) WITH CLUSTERING ORDER BY (version DESC);
>
>
> The query (select * from file where id = 'xxx' limit 1;) provided solves
> the problem of finding the latest version file. But I have one more
> requirement of finding all the latest version files having parentid say
> 'yyy'.
>
> Please suggest how can this query be achieved.
>
> Dawood
>
>
>
> On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah <
> muhammed.daw...@gmail.com> wrote:
>
>> In my case version can be timestamp as well. What do you suggest version
>> number to be, do you see any problems if I keep version as counter /
>> timestamp ?
>>
>>
>> On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen <
>> jan.algermis...@nordsc.com> wrote:
>>
>>>
>>> On 02.09.2013, at 20:44, dawood abdullah 
>>> wrote:
>>>
>>> > Requirement is like I have a column family say File
>>> >
>>> > create table file(id text primary key, fname text, version int,
>>> mimetype text, content text);
>>> >
>>> > Say, I have few records inserted, when I modify an existing record
>>> (content is modified) a new version needs to be created. As I need to have
>>> provision to revert to back any old version whenever required.
>>> >
>>>
>>> So, can version be a timestamp? Or does it need to be an integer?
>>>
>>> In the former case, make use of C*'s ordering like so:
>>>
>>> CREATE TABLE file (
>>>file_id text,
>>>version timestamp,
>>>fname text,
>>>
>>>PRIMARY KEY (file_id,version)
>>> ) WITH CLUSTERING ORDER BY (version DESC);
>>>
>>> Get the latest file version with
>>>
>>> select * from file where file_id = 'xxx' limit 1;
>>>
>>> If it has to be an integer, use counter columns.
>>>
>>> Jan
>>>
>>>
>>> > Regards,
>>> > Dawood
>>> >
>>> >
>>> > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen <
>>> jan.algermis...@nordsc.com> wrote:
>>> > Hi Dawood,
>>> >
>>> > On 02.09.2013, at 16:36, dawood abdullah 
>>> wrote:
>>> >
>>> > > Hi
>>> > > I have a requirement of versioning to be done in Cassandra.
>>> > >
>>> > > Following is my column family definition
>>> > >
>>> > > create table file_details(id text primary key, fname text, version
>>> int, mimetype text);
>>> > >
>>> > > I have a secondary index created on fname column.
>>> > >
>>> > > Whenever I do an insert for the same 'fname', the version should be
>>> incremented. And when I retrieve a row with fname it should return me the
>>> latest version row.
>>> > >
>>> > > Is there a better way to do in Cassandra? Please suggest what
>>> approach needs to be taken.
>>> >
>>> > Can you explain more about your use case?
>>> >
>>> > If the version need not be a small number, but could be a timestamp,
>>> you could make use of C*'s ordering feature , have the database set the new
>>> version as a timestamp and retrieve the latest one with a simple LIMIT 1
>>> query. (I'll explain more when this is an option for you).
>>> >
>>> > Jan
>>> >
>>> > P.S. Me being a REST/HTTP head, an alarm rings when I see 'version'
>>> next to 'mimetype' :-) What exactly are you versioning here? Maybe we can
>>> even change the situation from a functional POV?
>>> >
>>> >
>>> > >
>>> > > Regards,
>>> > >
>>> > > Dawood
>>> > >
>>> > >
>>> > >
>>> > >
>>> >
>>> >
>>>
>>>
>>
>


Re: Versioning in cassandra

2013-09-03 Thread Vivek Mishra
My bad. I did miss out to read "latest version" part.

-Vivek


On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah
wrote:

> I have tried with both the options creating secondary index and also tried
> adding parentid to primary key, but I am getting all the files with
> parentid 'yyy', what I want is the latest version of file with the
> combination of parentid, fileid. Say below are the records inserted in the
> file table:
>
> insert into file (id, parentid, version, contenttype, description, name)
> values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
> insert into file (id, parentid, version, contenttype, description, name)
> values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
> insert into file (id, parentid, version, contenttype, description, name)
> values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
> insert into file (id, parentid, version, contenttype, description, name)
> values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1');
>
> I want to write a query which returns me second and last record and not
> the first and third record, because for the first and third record there
> exists a latest version, for the combination of id and parentid.
>
> I am confused If at all this is achievable, please suggest.
>
> Dawood
>
>
>
> On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra wrote:
>
>> create secondary index over parentid.
>> OR
>> make it part of clustering key
>>
>> -Vivek
>>
>>
>> On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah <
>> muhammed.daw...@gmail.com> wrote:
>>
>>> Jan,
>>>
>>> The solution you gave works spot on, but there is one more requirement I
>>> forgot to mention. Following is my table structure
>>>
>>> CREATE TABLE file (
>>>   id text,
>>>   contenttype text,
>>>   createdby text,
>>>   createdtime timestamp,
>>>   description text,
>>>   name text,
>>>   parentid text,
>>>   version timestamp,
>>>   PRIMARY KEY (id, version)
>>>
>>> ) WITH CLUSTERING ORDER BY (version DESC);
>>>
>>>
>>> The query (select * from file where id = 'xxx' limit 1;) provided solves
>>> the problem of finding the latest version file. But I have one more
>>> requirement of finding all the latest version files having parentid say
>>> 'yyy'.
>>>
>>> Please suggest how can this query be achieved.
>>>
>>> Dawood
>>>
>>>
>>>
>>> On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah <
>>> muhammed.daw...@gmail.com> wrote:
>>>
 In my case version can be timestamp as well. What do you suggest
 version number to be, do you see any problems if I keep version as counter
 / timestamp ?


 On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen <
 jan.algermis...@nordsc.com> wrote:

>
> On 02.09.2013, at 20:44, dawood abdullah 
> wrote:
>
> > Requirement is like I have a column family say File
> >
> > create table file(id text primary key, fname text, version int,
> mimetype text, content text);
> >
> > Say, I have few records inserted, when I modify an existing record
> (content is modified) a new version needs to be created. As I need to have
> provision to revert to back any old version whenever required.
> >
>
> So, can version be a timestamp? Or does it need to be an integer?
>
> In the former case, make use of C*'s ordering like so:
>
> CREATE TABLE file (
>file_id text,
>version timestamp,
>fname text,
>
>PRIMARY KEY (file_id,version)
> ) WITH CLUSTERING ORDER BY (version DESC);
>
> Get the latest file version with
>
> select * from file where file_id = 'xxx' limit 1;
>
> If it has to be an integer, use counter columns.
>
> Jan
>
>
> > Regards,
> > Dawood
> >
> >
> > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen <
> jan.algermis...@nordsc.com> wrote:
> > Hi Dawood,
> >
> > On 02.09.2013, at 16:36, dawood abdullah 
> wrote:
> >
> > > Hi
> > > I have a requirement of versioning to be done in Cassandra.
> > >
> > > Following is my column family definition
> > >
> > > create table file_details(id text primary key, fname text, version
> int, mimetype text);
> > >
> > > I have a secondary index created on fname column.
> > >
> > > Whenever I do an insert for the same 'fname', the version should
> be incremented. And when I retrieve a row with fname it should return me
> the latest version row.
> > >
> > > Is there a better way to do in Cassandra? Please suggest what
> approach needs to be taken.
> >
> > Can you explain more about your use case?
> >
> > If the version need not be a small number, but could be a timestamp,
> you could make use of C*'s ordering feature , have the database set the 
> new
> version as a timestamp and retrieve the latest one with a simple LIMIT 1
> query. (I'll explain more when this is an option for you).
> >
> > Jan
> >
>>

RE: map/reduce performance time and sstable readerŠ.

2013-09-03 Thread java8964 java8964
I am trying to do the same thing, as in our project, we want to load the data 
from Cassandra into Hadoop cluster, and SSTable is one obvious option, as you 
can get the changed data since last batch loading directly from the SSTable 
incremental backup files.
But, based on so far my research (I maybe wrong, as I just did limited research 
about the SSTable, I hope someone in this forum can tell me that I am wrong), 
it maybe is NOT a good option:
1) sstable2json looks like NOT a scalable solution to get the data out from the 
Cassandra, and it needs the access to "data" directory to get some meta data 
from system keyspace for the column family data dumped, which maybe is not an 
option in your MR environment.2) So far I am thinking reuse the same API as 
being used in the sstable2json, but I have to provide these metadata in the 
API, like validator types/partitioner etc. I am surprised that as a backup, the 
column family SSTable dump files DOESN't contain these information by itself. 
Shouldn't it find out this from the SSTable files(ONLY) by itself?3) The big 
trouble comes this if you want to parse the SSTables in  your MR code. The API 
internal will load the Index/Compression_Info information from the 
Index/Compression files, which it assumes located in the same place  as the 
data file, but it will use the FileSteam internal. So if these data files are 
in the DFS (Distributed File System), so far, I didn't find a way to tell the 
API to use the stream from the DFS, instead of Local File Input stream. So 
basically you have 2 options: a) Copy these files from DSF to local file system 
(Same as what Knewton guys did at https://github.com/Knewton/KassandraMRHelper) 
b) Develop your own API to access the SStable files directly ( My guess is that 
Netflix guys probably did this way. They have a project called "Aegisthus" (See 
here: 
http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html), 
but it is not open source.4) About the performance, I am not sure, as 
SSTable2Json underline is using the same Cassandra API, but running in MR give 
us some support in scalability, as we can reuse the Hadoop framework for a lot 
of benefits it can bring.
Yong

> From: dean.hil...@nrel.gov
> To: user@cassandra.apache.org
> Date: Fri, 30 Aug 2013 07:25:09 -0600
> Subject: map/reduce performance time and sstable readerŠ.
> 
> Has anyone done performance tests on sstable reading vs. M/R?  I did a quick 
> test on reading all SSTAbles in a LCS column family on 23 tables and took the 
> average time it took sstable2json(to /dev/null to make it faster) which was 7 
> seconds per table.  (reading to stdout took 16 seconds per table).  This then 
> worked out to an estimation of 12.5 hours up to 27 hours(from to stdout 
> calculation).  I am suspecting the map/reduce time may be much worse since 
> there are not as many repeated rows in LCS
> 
> Ie. I am wondering if I should just read from SSTAbles directly instead of 
> map/reduce?   I am about to dig around in the code of M/R and sstable2json to 
> see what each is doing specifically.
> 
> Thanks,
> Dean
  

Re: Cassandra cluster migration in Amazon EC2

2013-09-03 Thread Robert Coli
On Mon, Sep 2, 2013 at 4:21 PM, Renat Gilfanov  wrote:

> - Group 3 of storages into raid0 array, move data directory to the raid0,
> and commit log - to the 4th left storage.
>  - As far as I understand, separation of commit log and data directory
> should make performance better - but what about separation the OS from
> those two  - is it worth doing?
>

Nope. Best practice for amazon is ephemeral disks, and RAID0 for data +
commit log.


>  - What are the steps to perform such migration? Will it be possible to
> perform it without downtime, restarting node by node with new configuration
> applied?
>  I'm especially worried about IP changes, when we'll uprade the instance
> type. What's the recomended way to handle those IP changes?
>

Just set auto_bootstrap:false in cassandra.yaml to change the IP address of
a node to which you have copied all the data its token had before its IP
address changed and therefore does not need to be bootstrapped.

=Rob


Re: Versioning in cassandra

2013-09-03 Thread dawood abdullah
I have tried with both the options creating secondary index and also tried
adding parentid to primary key, but I am getting all the files with
parentid 'yyy', what I want is the latest version of file with the
combination of parentid, fileid. Say below are the records inserted in the
file table:

insert into file (id, parentid, version, contenttype, description, name)
values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
insert into file (id, parentid, version, contenttype, description, name)
values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
insert into file (id, parentid, version, contenttype, description, name)
values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
insert into file (id, parentid, version, contenttype, description, name)
values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1');

I want to write a query which returns me second and last record and not the
first and third record, because for the first and third record there exists
a latest version, for the combination of id and parentid.

I am confused If at all this is achievable, please suggest.

Dawood



On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra  wrote:

> create secondary index over parentid.
> OR
> make it part of clustering key
>
> -Vivek
>
>
> On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah <
> muhammed.daw...@gmail.com> wrote:
>
>> Jan,
>>
>> The solution you gave works spot on, but there is one more requirement I
>> forgot to mention. Following is my table structure
>>
>> CREATE TABLE file (
>>   id text,
>>   contenttype text,
>>   createdby text,
>>   createdtime timestamp,
>>   description text,
>>   name text,
>>   parentid text,
>>   version timestamp,
>>   PRIMARY KEY (id, version)
>>
>> ) WITH CLUSTERING ORDER BY (version DESC);
>>
>>
>> The query (select * from file where id = 'xxx' limit 1;) provided solves
>> the problem of finding the latest version file. But I have one more
>> requirement of finding all the latest version files having parentid say
>> 'yyy'.
>>
>> Please suggest how can this query be achieved.
>>
>> Dawood
>>
>>
>>
>> On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah <
>> muhammed.daw...@gmail.com> wrote:
>>
>>> In my case version can be timestamp as well. What do you suggest version
>>> number to be, do you see any problems if I keep version as counter /
>>> timestamp ?
>>>
>>>
>>> On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen <
>>> jan.algermis...@nordsc.com> wrote:
>>>

 On 02.09.2013, at 20:44, dawood abdullah 
 wrote:

 > Requirement is like I have a column family say File
 >
 > create table file(id text primary key, fname text, version int,
 mimetype text, content text);
 >
 > Say, I have few records inserted, when I modify an existing record
 (content is modified) a new version needs to be created. As I need to have
 provision to revert to back any old version whenever required.
 >

 So, can version be a timestamp? Or does it need to be an integer?

 In the former case, make use of C*'s ordering like so:

 CREATE TABLE file (
file_id text,
version timestamp,
fname text,

PRIMARY KEY (file_id,version)
 ) WITH CLUSTERING ORDER BY (version DESC);

 Get the latest file version with

 select * from file where file_id = 'xxx' limit 1;

 If it has to be an integer, use counter columns.

 Jan


 > Regards,
 > Dawood
 >
 >
 > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen <
 jan.algermis...@nordsc.com> wrote:
 > Hi Dawood,
 >
 > On 02.09.2013, at 16:36, dawood abdullah 
 wrote:
 >
 > > Hi
 > > I have a requirement of versioning to be done in Cassandra.
 > >
 > > Following is my column family definition
 > >
 > > create table file_details(id text primary key, fname text, version
 int, mimetype text);
 > >
 > > I have a secondary index created on fname column.
 > >
 > > Whenever I do an insert for the same 'fname', the version should be
 incremented. And when I retrieve a row with fname it should return me the
 latest version row.
 > >
 > > Is there a better way to do in Cassandra? Please suggest what
 approach needs to be taken.
 >
 > Can you explain more about your use case?
 >
 > If the version need not be a small number, but could be a timestamp,
 you could make use of C*'s ordering feature , have the database set the new
 version as a timestamp and retrieve the latest one with a simple LIMIT 1
 query. (I'll explain more when this is an option for you).
 >
 > Jan
 >
 > P.S. Me being a REST/HTTP head, an alarm rings when I see 'version'
 next to 'mimetype' :-) What exactly are you versioning here? Maybe we can
 even change the situation from a functional POV?
 >
 >
 > >
 > > Regards,
 > >
 > > Dawood
 > >
 > >
 >

Re: read ?

2013-09-03 Thread Langston, Jim
Thanks Chris,

I have about 8 heap dumps that I have been looking at. I have been trying to 
isolate
as to why I have be dumping heap, I've started by removing the apps that write 
to
cassandra and eliminating work that would entail. I am left with just the apps 
that
are reading the data and from the heap dumps it looks like Cassandra Column 
methods
being called, because there are so many objects, it is difficult to ascertain 
exactly what
the problem may be. That prompted my query, trying to quickly determine if 
Cassandra
holds objects that have been used for reading, and if so, why, and more 
importantly if
something can be done.

Jim

From: "Lohfink, Chris" mailto:chris.lohf...@digi.com>>
Reply-To: mailto:user@cassandra.apache.org>>
Date: Tue, 3 Sep 2013 11:12:19 -0500
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: RE: read ?

To get an accurate picture you should force a full GC on each node, the heap 
utilization can be misleading since there can be a lot of things in the heap 
with no strong references.

There is a number of factors that can lead to this.  For a true comparison I 
would recommend using jconsole and call dumpHeap on 
com.sun.management:type=HotSpotDiagnostic with the 2nd param true (force GC).  
Then open the heap dump up in a tool like yourkit and you will get a better 
comparison and also it will tell you what it is that’s taking the space.

Chris

From: Langston, Jim [mailto:jim.langs...@compuware.com]
Sent: Tuesday, September 03, 2013 8:20 AM
To: user@cassandra.apache.org
Subject: read ?

Hi all,

Quick question

I currently am looking at a 4 node cluster and I have currently stopped all 
writing to
Cassandra,  with the reads continuing. I'm trying to understand the utilization
of memory within the JVM. nodetool info on each of the nodes shows them all
growing in footprint, 2 of the three at a greater rate. On the restart of 
Cassandra
each were at about 100MB, after 2 days, each of the following are at:

Heap Memory (MB) : 798.41 / 3052.00

Heap Memory (MB) : 370.44 / 3052.00

Heap Memory (MB) : 549.73 / 3052.00

Heap Memory (MB) : 481.89 / 3052.00

Ring configuration:

Address RackStatus State   LoadOwns
Token
   
127605887595351923798765477786913079296
x 1d  Up Normal  4.38 GB 25.00%  0
x   1d  Up Normal  4.17 GB 25.00%  
42535295865117307932921825928971026432
x   1d  Up Normal  4.19 GB 25.00%  
85070591730234615865843651857942052864
x   1d  Up Normal  4.14 GB 25.00%  
127605887595351923798765477786913079296


What I'm not sure of is what the growth is different between each ? and why
that growth is being created by activity that is read only.

Is Cassandra caching and holding the read data ?

I currently have caching turned off for the key/row. Also as part of the info 
command

Key Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN 
recent hit rate, 14400 save period in seconds
Row Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN 
recent hit rate, 0 save period in seconds



Thanks,

Jim


Re: map/reduce performance time and sstable readerŠ.

2013-09-03 Thread Hiller, Dean
We are considering creating our own InputFormat for hadoop and running the 
tasktrackers on every 3rd node(ie. RF=3) such that we cover all ranges.  Our 
M/R overhead appears to be 13 days vs. 12.5 hours on just reading SSTAbles 
directly on our current data set.

I personally don't think parsing SSTables(using the hadoop M/R framework) is a 
big deal from us since we run task trackers on the cassandra nodes we need it 
on.  Ie. We don't need to copy to DFS to do this I believe(at least not in our 
situation).

I already wrote a client on the SSTableReader parsing out sstables to take a 
look at some of our data while our 13 day M/R job is running(we are 4 days in 
already with no failures and no performance degradation).

later,
Dean

From: java8964 java8964 mailto:java8...@hotmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Tuesday, September 3, 2013 12:06 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: RE: map/reduce performance time and sstable readerŠ.

I am trying to do the same thing, as in our project, we want to load the data 
from Cassandra into Hadoop cluster, and SSTable is one obvious option, as you 
can get the changed data since last batch loading directly from the SSTable 
incremental backup files.

But, based on so far my research (I maybe wrong, as I just did limited research 
about the SSTable, I hope someone in this forum can tell me that I am wrong), 
it maybe is NOT a good option:

1) sstable2json looks like NOT a scalable solution to get the data out from the 
Cassandra, and it needs the access to "data" directory to get some meta data 
from system keyspace for the column family data dumped, which maybe is not an 
option in your MR environment.
2) So far I am thinking reuse the same API as being used in the sstable2json, 
but I have to provide these metadata in the API, like validator 
types/partitioner etc. I am surprised that as a backup, the column family 
SSTable dump files DOESN't contain these information by itself. Shouldn't it 
find out this from the SSTable files(ONLY) by itself?
3) The big trouble comes this if you want to parse the SSTables in  your MR 
code. The API internal will load the Index/Compression_Info information from 
the Index/Compression files, which it assumes located in the same place  as the 
data file, but it will use the FileSteam internal. So if these data files are 
in the DFS (Distributed File System), so far, I didn't find a way to tell the 
API to use the stream from the DFS, instead of Local File Input stream. So 
basically you have 2 options: a) Copy these files from DSF to local file system 
(Same as what Knewton guys did at https://github.com/Knewton/KassandraMRHelper) 
b) Develop your own API to access the SStable files directly ( My guess is that 
Netflix guys probably did this way. They have a project called 
"Aegisthus" (See here: 
http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html), 
but it is not open source.
4) About the performance, I am not sure, as SSTable2Json underline is using the 
same Cassandra API, but running in MR give us some support in scalability, as 
we can reuse the Hadoop framework for a lot of benefits it can bring.

Yong

> From: dean.hil...@nrel.gov
> To: user@cassandra.apache.org
> Date: Fri, 30 Aug 2013 07:25:09 -0600
> Subject: map/reduce performance time and sstable readerŠ.
>
> Has anyone done performance tests on sstable reading vs. M/R? I did a quick 
> test on reading all SSTAbles in a LCS column family on 23 tables and took the 
> average time it took sstable2json(to /dev/null to make it faster) which was 7 
> seconds per table. (reading to stdout took 16 seconds per table). This then 
> worked out to an estimation of 12.5 hours up to 27 hours(from to stdout 
> calculation). I am suspecting the map/reduce time may be much worse since 
> there are not as many repeated rows in LCS
>
> Ie. I am wondering if I should just read from SSTAbles directly instead of 
> map/reduce? I am about to dig around in the code of M/R and sstable2json to 
> see what each is doing specifically.
>
> Thanks,
> Dean


Re: Versioning in cassandra

2013-09-03 Thread Laing, Michael
try the following. -ml

-- put this in  and run using 'cqlsh -f 

DROP KEYSPACE latest;

CREATE KEYSPACE latest WITH replication = {
'class': 'SimpleStrategy',
'replication_factor' : 1
};

USE latest;

CREATE TABLE file (
parentid text, -- row_key, same for each version
id text, -- column_key, same for each version
contenttype map, -- differs by version, version is the
key to the map
PRIMARY KEY (parentid, id)
);

update file set contenttype = contenttype + {'2011-03-04':'pdf1'} where
parentid = 'd1' and id = 'f1';
update file set contenttype = contenttype + {'2011-03-05':'pdf2'} where
parentid = 'd1' and id = 'f1';
update file set contenttype = contenttype + {'2011-03-04':'pdf3'} where
parentid = 'd1' and id = 'f2';
update file set contenttype = contenttype + {'2011-03-05':'pdf4'} where
parentid = 'd1' and id = 'f2';

select * from file where parentid = 'd1';

-- returns:

-- parentid | id | contenttype
++--
--   d1 | f1 | {'2011-03-04 00:00:00-0500': 'pdf1', '2011-03-05
00:00:00-0500': 'pdf2'}
--   d1 | f2 | {'2011-03-04 00:00:00-0500': 'pdf3', '2011-03-05
00:00:00-0500': 'pdf4'}

-- use an app to pop off the latest version from the map

-- map other varying fields using the same technique as used for contenttype



On Tue, Sep 3, 2013 at 2:31 PM, Vivek Mishra  wrote:

> create table file(id text , parentid text,contenttype text,version
> timestamp, descr text, name text, PRIMARY KEY(id,version) ) WITH CLUSTERING
> ORDER BY (version DESC);
>
> insert into file (id, parentid, version, contenttype, descr, name) values
> ('f2', 'd1', '2011-03-06', 'pdf', 'f2 file', 'file1');
> insert into file (id, parentid, version, contenttype, descr, name) values
> ('f2', 'd1', '2011-03-05', 'pdf', 'f2 file', 'file1');
> insert into file (id, parentid, version, contenttype, descr, name) values
> ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
> insert into file (id, parentid, version, contenttype, descr, name) values
> ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
> create index on file(parentid);
>
>
> select * from file where id='f1' and parentid='d1' limit 1;
>
> select * from file where parentid='d1' limit 1;
>
>
> Will it work for you?
>
> -Vivek
>
>
>
>
> On Tue, Sep 3, 2013 at 11:29 PM, Vivek Mishra wrote:
>
>> My bad. I did miss out to read "latest version" part.
>>
>> -Vivek
>>
>>
>> On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah <
>> muhammed.daw...@gmail.com> wrote:
>>
>>> I have tried with both the options creating secondary index and also
>>> tried adding parentid to primary key, but I am getting all the files with
>>> parentid 'yyy', what I want is the latest version of file with the
>>> combination of parentid, fileid. Say below are the records inserted in the
>>> file table:
>>>
>>> insert into file (id, parentid, version, contenttype, description, name)
>>> values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
>>> insert into file (id, parentid, version, contenttype, description, name)
>>> values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>>> insert into file (id, parentid, version, contenttype, description, name)
>>> values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>>> insert into file (id, parentid, version, contenttype, description, name)
>>> values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1');
>>>
>>> I want to write a query which returns me second and last record and not
>>> the first and third record, because for the first and third record there
>>> exists a latest version, for the combination of id and parentid.
>>>
>>> I am confused If at all this is achievable, please suggest.
>>>
>>> Dawood
>>>
>>>
>>>
>>> On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra wrote:
>>>
 create secondary index over parentid.
 OR
 make it part of clustering key

 -Vivek


 On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah <
 muhammed.daw...@gmail.com> wrote:

> Jan,
>
> The solution you gave works spot on, but there is one more requirement
> I forgot to mention. Following is my table structure
>
> CREATE TABLE file (
>   id text,
>   contenttype text,
>   createdby text,
>   createdtime timestamp,
>   description text,
>   name text,
>   parentid text,
>   version timestamp,
>   PRIMARY KEY (id, version)
>
> ) WITH CLUSTERING ORDER BY (version DESC);
>
>
> The query (select * from file where id = 'xxx' limit 1;) provided
> solves the problem of finding the latest version file. But I have one more
> requirement of finding all the latest version files having parentid say
> 'yyy'.
>
> Please suggest how can this query be achieved.
>
> Dawood
>
>
>
> On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah <
> muhammed.daw...@gmail.com> wrote:
>
>> In my case version can be timestamp

Re: Versioning in cassandra

2013-09-03 Thread Vivek Mishra
create table file(id text , parentid text,contenttype text,version
timestamp, descr text, name text, PRIMARY KEY(id,version) ) WITH CLUSTERING
ORDER BY (version DESC);

insert into file (id, parentid, version, contenttype, descr, name) values
('f2', 'd1', '2011-03-06', 'pdf', 'f2 file', 'file1');
insert into file (id, parentid, version, contenttype, descr, name) values
('f2', 'd1', '2011-03-05', 'pdf', 'f2 file', 'file1');
insert into file (id, parentid, version, contenttype, descr, name) values
('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
insert into file (id, parentid, version, contenttype, descr, name) values
('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
create index on file(parentid);


select * from file where id='f1' and parentid='d1' limit 1;

select * from file where parentid='d1' limit 1;


Will it work for you?

-Vivek




On Tue, Sep 3, 2013 at 11:29 PM, Vivek Mishra  wrote:

> My bad. I did miss out to read "latest version" part.
>
> -Vivek
>
>
> On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah <
> muhammed.daw...@gmail.com> wrote:
>
>> I have tried with both the options creating secondary index and also
>> tried adding parentid to primary key, but I am getting all the files with
>> parentid 'yyy', what I want is the latest version of file with the
>> combination of parentid, fileid. Say below are the records inserted in the
>> file table:
>>
>> insert into file (id, parentid, version, contenttype, description, name)
>> values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
>> insert into file (id, parentid, version, contenttype, description, name)
>> values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>> insert into file (id, parentid, version, contenttype, description, name)
>> values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>> insert into file (id, parentid, version, contenttype, description, name)
>> values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1');
>>
>> I want to write a query which returns me second and last record and not
>> the first and third record, because for the first and third record there
>> exists a latest version, for the combination of id and parentid.
>>
>> I am confused If at all this is achievable, please suggest.
>>
>> Dawood
>>
>>
>>
>> On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra wrote:
>>
>>> create secondary index over parentid.
>>> OR
>>> make it part of clustering key
>>>
>>> -Vivek
>>>
>>>
>>> On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah <
>>> muhammed.daw...@gmail.com> wrote:
>>>
 Jan,

 The solution you gave works spot on, but there is one more requirement
 I forgot to mention. Following is my table structure

 CREATE TABLE file (
   id text,
   contenttype text,
   createdby text,
   createdtime timestamp,
   description text,
   name text,
   parentid text,
   version timestamp,
   PRIMARY KEY (id, version)

 ) WITH CLUSTERING ORDER BY (version DESC);


 The query (select * from file where id = 'xxx' limit 1;) provided
 solves the problem of finding the latest version file. But I have one more
 requirement of finding all the latest version files having parentid say
 'yyy'.

 Please suggest how can this query be achieved.

 Dawood



 On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah <
 muhammed.daw...@gmail.com> wrote:

> In my case version can be timestamp as well. What do you suggest
> version number to be, do you see any problems if I keep version as counter
> / timestamp ?
>
>
> On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen <
> jan.algermis...@nordsc.com> wrote:
>
>>
>> On 02.09.2013, at 20:44, dawood abdullah 
>> wrote:
>>
>> > Requirement is like I have a column family say File
>> >
>> > create table file(id text primary key, fname text, version int,
>> mimetype text, content text);
>> >
>> > Say, I have few records inserted, when I modify an existing record
>> (content is modified) a new version needs to be created. As I need to 
>> have
>> provision to revert to back any old version whenever required.
>> >
>>
>> So, can version be a timestamp? Or does it need to be an integer?
>>
>> In the former case, make use of C*'s ordering like so:
>>
>> CREATE TABLE file (
>>file_id text,
>>version timestamp,
>>fname text,
>>
>>PRIMARY KEY (file_id,version)
>> ) WITH CLUSTERING ORDER BY (version DESC);
>>
>> Get the latest file version with
>>
>> select * from file where file_id = 'xxx' limit 1;
>>
>> If it has to be an integer, use counter columns.
>>
>> Jan
>>
>>
>> > Regards,
>> > Dawood
>> >
>> >
>> > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen <
>> jan.algermis...@nordsc.com> wrote:
>> > Hi Dawood,
>> >
>> > On 02.09.2013, at

Re: [RELEASE] Apache Cassandra 2.0 released

2013-09-03 Thread Jeremiah D Jordan
Thanks for everyone's work on this release!

-Jeremiah

On Sep 3, 2013, at 8:48 AM, Sylvain Lebresne  wrote:

> The Cassandra team is very pleased to announce the release of Apache Cassandra
> version 2.0.0. Cassandra 2.0.0 is a new major release that adds numerous
> improvements[1,2], including:
>   - Lightweight transactions[4] that offers linearizable consistency.
>   - Experimental Triggers Support[5].
>   - Numerous enhancements to CQL as well as a new and better version of the
> native protocol[6].
>   - Compaction improvements[7] (including a hybrid strategy that combines 
> leveled and size-tiered compaction).
>   - A new faster Thrift Server implementation based on LMAX Disruptor[8].
>   - Eager retries: avoids query timeout by sending data requests to other
> replicas if too much time passes on the original request.
> 
> See the full changelog[1] for more and please make sure to check the release
> notes[2] for upgrading details.
> 
> Both source and binary distributions of Cassandra 2.0.0 can be downloaded at:
> 
>  http://cassandra.apache.org/download/
> 
> As usual, a debian package is available from the project APT repository[3]
> (you will need to use the 20x series).
> 
> The Cassandra team
> 
> [1]: http://goo.gl/zU4sWv (CHANGES.txt)
> [2]: http://goo.gl/MrR6Qn (NEWS.txt)
> [3]: http://wiki.apache.org/cassandra/DebianPackaging
> [4]: 
> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
> [5]: 
> http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-0-prototype-triggers-support
> [6]: http://www.datastax.com/dev/blog/cql-in-cassandra-2-0
> [7]: https://issues.apache.org/jira/browse/CASSANDRA-5371
> [8]: https://issues.apache.org/jira/browse/CASSANDRA-5582
> 



How to fix host ID collision?

2013-09-03 Thread Renat Gilfanov

Hello,

We have Cassandra cluster with 5 nodes hosted in the Amazon EC2, and  I had to 
restart two of them, so their IPs changed.
We use NetworkTopologyStrategy, so I simply updated IPs in the 
cassandra-topology.properties file.

However, as I understood, old IPs remained somewhere in the system keyspace, 
and now I observe several different exception stacktraces in the log files, 
including:

java.lang.RuntimeException: Host ID collision between active endpoint / 
and / (id=ab66dd02-96b2-4504-8403-7d066f911698)
    at 
org.apache.cassandra.locator.TokenMetadata.updateHostId(TokenMetadata.java:229)
    at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1358)
    at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1228)
    at 
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:1960)
    at 
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:837)
    at 
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:915)
    at 
org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
    at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)

and

java.lang.AssertionError: Missing host ID for 
    at 
org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:583)
    at 
org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:552)
    at 
org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1658)
    at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)


nodetool status being executed on 3 old nodes, shows old ghost node:

Datacenter: DC1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load   Tokens  Owns   Host ID   
    Rack
UN  10.14.128.109  2.8 GB 141 4.1%   
32260392-12c2-4f1a-812e-87fd9a960d10  RAC2
UN  10.24.33.187   2.12 GB    258 42.7%  
ab66dd02-96b2-4504-8403-7d066f911698  RAC3
UN  10.20.149.165  2.99 GB    251 4.5%   
a0792f59-20b1-4017-a7f6-88e0c0d7f86f  RAC1
DN  10.11.73.104   1.07 GB    2   1.0%   null   
   RAC1
Datacenter: DC2
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load   Tokens  Owns   Host ID   
    Rack
UN  10.34.78.23 2.21 GB    117 0.9%   
2acd3766-404d-4cdc-b3e3-7b3b95856f0e  RAC1
UN  10.20.23.171   2.22 GB    255 46.8%  
67421e3a-1dfc-48a0-88b3-c6dbd64dc9d8  RAC1


Is it possible to fix those IP collisions ?


Thanks.

Re: How to fix host ID collision?

2013-09-03 Thread Robert Coli
On Tue, Sep 3, 2013 at 2:01 PM, Renat Gilfanov  wrote:

>
> We have Cassandra cluster with 5 nodes hosted in the Amazon EC2, and  I
> had to restart two of them, so their IPs changed.
> We use NetworkTopologyStrategy, so I simply updated IPs in the
> cassandra-topology.properties file.
>

Set auto_bootstrap:false in the conf file and restart the node to change IP
address for a node.

=Rob


RE: Update-Replace

2013-09-03 Thread Baskar Duraikannu
I have a similar use case but only need to update portion of the row. We 
basically perform single write (with old and new columns) with very low value 
of ttl for old columns. 

> From: jan.algermis...@nordsc.com
> Subject: Update-Replace
> Date: Fri, 30 Aug 2013 17:35:48 +0200
> To: user@cassandra.apache.org
> 
> Hi,
> 
> I have a use case, where I periodically need to apply updates to a wide row 
> that should replace the whole row.
> 
> The straight-forward insert/update only replace values that are present in 
> the executed statement, keeping remaining data around.
> 
> Is there a smooth way to do a replace with C* or do I have to handle this by 
> the application (e.g. doing delete and then write or coming up with a more 
> clever data model)?
> 
> Jan
  

RE: List retrieve performance

2013-09-03 Thread Baskar Duraikannu
I don't know of any. I would check the size of LIST. If it is taking long, it 
could be just that disk read is taking long.  

Date: Sat, 31 Aug 2013 16:35:22 -0300
Subject: List retrieve performance
From: savio.te...@lupa.inf.ufg.br
To: user@cassandra.apache.org

I have a column family with this conf:

CREATE TABLE geoms (
  geom_key text PRIMARY KEY,
  part_geom list,
  the_geom text
) WITH
  bloom_filter_fp_chance=0.01 AND

  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.00 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.10 AND
  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};



I run this query "select geom_key, the_geom,  part_geom from geoms limit 1;" in 
700ms.

When I run the same query without part_geom attr (select geom_key, the_geom 
from geoms limit 1;), the query runs in 5 ms. 


Is there a performance problem with a List attribute?

Thanks in advance


-- 
Atenciosamente,
Sávio S. Teles de Oliveira
voice: +55 62 9136 6996
http://br.linkedin.com/in/savioteles



Mestrando em Ciências da Computação - UFG 
Arquiteto de Software
Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
  

Re: Versioning in cassandra

2013-09-03 Thread Laing, Michael
I use the technique described in my previous message to handle millions of
messages and their versions.

Actually, I use timeuuid's instead of timestamps, as they have more
'uniqueness'. Also I index my maps by a timeuuid that is the complement
(based on a future date) of a current timeuuid. Since maps are kept sorted
by key, this means I can just pop off the first one to get the most recent.

The downside of this approach is that you get more stuff returned to you
from Cassandra than you need. To mitigate that I queue a job to examine and
correct the situation if, upon doing a read, the number of versions for a
particular key is higher than some threshold, e.g. 50. There are many ways
to approach this problem.

Our actual implementation proceeds to another level, as we also have
replicas of versions. This happens because we process important
transactions in parallel and can expect up to 9 replicas of each version.
We journal them all and use them for reporting latencies in our processing
pipelines as well as for replay when we need to recover application state.

Regards,

Michael


On Tue, Sep 3, 2013 at 3:15 PM, Laing, Michael wrote:

> try the following. -ml
>
> -- put this in  and run using 'cqlsh -f 
>
> DROP KEYSPACE latest;
>
> CREATE KEYSPACE latest WITH replication = {
> 'class': 'SimpleStrategy',
> 'replication_factor' : 1
> };
>
> USE latest;
>
> CREATE TABLE file (
> parentid text, -- row_key, same for each version
> id text, -- column_key, same for each version
> contenttype map, -- differs by version, version is
> the key to the map
> PRIMARY KEY (parentid, id)
> );
>
> update file set contenttype = contenttype + {'2011-03-04':'pdf1'} where
> parentid = 'd1' and id = 'f1';
> update file set contenttype = contenttype + {'2011-03-05':'pdf2'} where
> parentid = 'd1' and id = 'f1';
> update file set contenttype = contenttype + {'2011-03-04':'pdf3'} where
> parentid = 'd1' and id = 'f2';
> update file set contenttype = contenttype + {'2011-03-05':'pdf4'} where
> parentid = 'd1' and id = 'f2';
>
> select * from file where parentid = 'd1';
>
> -- returns:
>
> -- parentid | id | contenttype
>
> ++--
> --   d1 | f1 | {'2011-03-04 00:00:00-0500': 'pdf1', '2011-03-05
> 00:00:00-0500': 'pdf2'}
> --   d1 | f2 | {'2011-03-04 00:00:00-0500': 'pdf3', '2011-03-05
> 00:00:00-0500': 'pdf4'}
>
> -- use an app to pop off the latest version from the map
>
> -- map other varying fields using the same technique as used for
> contenttype
>
>
>
> On Tue, Sep 3, 2013 at 2:31 PM, Vivek Mishra wrote:
>
>> create table file(id text , parentid text,contenttype text,version
>> timestamp, descr text, name text, PRIMARY KEY(id,version) ) WITH CLUSTERING
>> ORDER BY (version DESC);
>>
>> insert into file (id, parentid, version, contenttype, descr, name) values
>> ('f2', 'd1', '2011-03-06', 'pdf', 'f2 file', 'file1');
>> insert into file (id, parentid, version, contenttype, descr, name) values
>> ('f2', 'd1', '2011-03-05', 'pdf', 'f2 file', 'file1');
>> insert into file (id, parentid, version, contenttype, descr, name) values
>> ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>> insert into file (id, parentid, version, contenttype, descr, name) values
>> ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
>> create index on file(parentid);
>>
>>
>> select * from file where id='f1' and parentid='d1' limit 1;
>>
>> select * from file where parentid='d1' limit 1;
>>
>>
>> Will it work for you?
>>
>> -Vivek
>>
>>
>>
>>
>> On Tue, Sep 3, 2013 at 11:29 PM, Vivek Mishra wrote:
>>
>>> My bad. I did miss out to read "latest version" part.
>>>
>>> -Vivek
>>>
>>>
>>> On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah <
>>> muhammed.daw...@gmail.com> wrote:
>>>
 I have tried with both the options creating secondary index and also
 tried adding parentid to primary key, but I am getting all the files with
 parentid 'yyy', what I want is the latest version of file with the
 combination of parentid, fileid. Say below are the records inserted in the
 file table:

 insert into file (id, parentid, version, contenttype, description,
 name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
 insert into file (id, parentid, version, contenttype, description,
 name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
 insert into file (id, parentid, version, contenttype, description,
 name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
 insert into file (id, parentid, version, contenttype, description,
 name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1');

 I want to write a query which returns me second and last record and not
 the first and third record, because for the first and third record there
 exists a latest version, for the combination of id and parentid.

 I am confused If at all this

Re: Update-Replace

2013-09-03 Thread Jan Algermissen
Baskar,

On 03.09.2013, at 23:11, Baskar Duraikannu  
wrote:

> I have a similar use case but only need to update portion of the row. We 
> basically perform single write (with old and new columns) with very low value 
> of ttl for old columns. 

I found out that using bound statements with java-driver works quite well for 
this case because the fields with a ? in the prepared statement but without a 
bound value will be automatically set to null - hence removed.

So this actually automagically does what you/I want.

See 


Jan

> 
> > From: jan.algermis...@nordsc.com
> > Subject: Update-Replace
> > Date: Fri, 30 Aug 2013 17:35:48 +0200
> > To: user@cassandra.apache.org
> > 
> > Hi,
> > 
> > I have a use case, where I periodically need to apply updates to a wide row 
> > that should replace the whole row.
> > 
> > The straight-forward insert/update only replace values that are present in 
> > the executed statement, keeping remaining data around.
> > 
> > Is there a smooth way to do a replace with C* or do I have to handle this 
> > by the application (e.g. doing delete and then write or coming up with a 
> > more clever data model)?
> > 
> > Jan



cqlsh error after enabling encryption

2013-09-03 Thread David Laube
Hi All,

After enabling encryption on our Cassandra 1.2.8 nodes, we receiving the error 
"Connection error: TSocket read 0 bytes" while attempting to use CQLsh to talk 
to the ring. I've followed the docs over at 
http://www.datastax.com/documentation/cassandra/1.2/webhelp/cassandra/security/secureCqlshSSL_t.html
 but can't seem to figure out why this isn't working. Inter-node communication 
seems to be working properly since "nodetool status" shows our nodes as up, but 
the CQLsh client is unable to talk to a single node or any node in the cluster 
(specifying the IP in .cqlshrc or on the CLI) for some reason. I'm providing 
the applicable config file entries below for review. Any insight or suggestions 
would be greatly appreciated! :)



My ~/.cqlshrc file:


[connection]
hostname = 127.0.0.1
port = 9160
factory = cqlshlib.ssl.ssl_transport_factory

[ssl]
certfile = /etc/cassandra/conf/cassandra_client.crt
validate = true ## Optional, true by default.

[certfiles] ## Optional section, overrides the default certfile in the [ssl] 
section.
192.168.1.3 = ~/keys/cassandra01.cert
192.168.1.4 = ~/keys/cassandra02.cert




Our cassandra.yaml file config blocks:

…snip…

server_encryption_options:
internode_encryption: all
keystore: /etc/cassandra/conf/.keystore
keystore_password: yeah-right
truststore: /etc/cassandra/conf/.truststore
truststore_password: yeah-right
# More advanced defaults below:
# protocol: TLS
# algorithm: SunX509
# store_type: JKS
# cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA]
# require_client_auth: false

# enable or disable client/server encryption.
client_encryption_options:
enabled: true
keystore: /etc/cassandra/conf/.keystore
keystore_password: yeah-right
# require_client_auth: false
# Set trustore and truststore_password if require_client_auth is true
# truststore: conf/.truststore
# truststore_password: cassandra
# More advanced defaults below:
protocol: TLS
algorithm: SunX509
store_type: JKS
cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA]

…snip...





Thanks,
-David Laube



Re[2]: How to fix host ID collision?

2013-09-03 Thread Renat Gilfanov
 Thanks a lot for the quick reply, 

Should I run the nodetool repair on all nodes before or after that? 
Also, it's mentioned in the documentation that auto_bootstrap setting is 
applied only to non-seed nodes. Currently I specified all nodes as seeds, 
should I remove nodes with new IP from seeds then?


Вторник,  3 сентября 2013, 14:08 -07:00 от Robert Coli :
>On Tue, Sep 3, 2013 at 2:01 PM, Renat Gilfanov  < gren...@mail.ru > wrote:
>>
>>We have Cassandra cluster with 5 nodes hosted in the Amazon EC2, and  I had 
>>to restart two of them, so their IPs changed.
>>We use NetworkTopologyStrategy, so I simply updated IPs in the 
>>cassandra-topology.properties file.
>
>Set auto_bootstrap:false in the conf file and restart the node to change IP 
>address for a node.
>
>=Rob



Re: List retrieve performance

2013-09-03 Thread Sávio Teles
The list is "null".


2013/9/3 Baskar Duraikannu 

> I don't know of any. I would check the size of LIST. If it is taking long,
> it could be just that disk read is taking long.
>
> --
> Date: Sat, 31 Aug 2013 16:35:22 -0300
> Subject: List retrieve performance
> From: savio.te...@lupa.inf.ufg.br
> To: user@cassandra.apache.org
>
>
> I have a column family with this conf:
>
> CREATE TABLE geoms (
>   geom_key text PRIMARY KEY,
>   part_geom list,
>   the_geom text
> ) WITH
>   bloom_filter_fp_chance=0.01 AND
>   caching='KEYS_ONLY' AND
>   comment='' AND
>   dclocal_read_repair_chance=0.00 AND
>   gc_grace_seconds=864000 AND
>   read_repair_chance=0.10 AND
>   replicate_on_write='true' AND
>   populate_io_cache_on_flush='false' AND
>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>   compression={'sstable_compression': 'SnappyCompressor'};
>
>
> I run this query "*select geom_key, the_geom, part_geom from geoms limit
> 1;*" in 700ms.
>
> When I run the same query without part_geom attr (*select geom_key,
> the_geom from geoms limit 1;)*, the query runs in 5 ms.
>
> *Is there a performance problem with a List attribute?
>
> *
> *Thanks in advance
> *
>
> --
> Atenciosamente,
> Sávio S. Teles de Oliveira
> voice: +55 62 9136 6996
> http://br.linkedin.com/in/savioteles
>  Mestrando em Ciências da Computação - UFG
> Arquiteto de Software
> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
>



-- 
Atenciosamente,
Sávio S. Teles de Oliveira
voice: +55 62 9136 6996
http://br.linkedin.com/in/savioteles
Mestrando em Ciências da Computação - UFG
Arquiteto de Software
Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG


Fwd: {kundera-discuss} Kundera 2.7 released

2013-09-03 Thread Vivek Mishra
fyip.

-- Forwarded message --
From: Vivek Mishra 
Date: Wed, Sep 4, 2013 at 6:15 AM
Subject: {kundera-discuss} Kundera 2.7 released
To: "kundera-disc...@googlegroups.com" 


Hi All,

We are happy to announce the release of Kundera 2.7 .

Kundera is a JPA 2.0 compliant, object-datastore mapping library for NoSQL
datastores. The idea behind Kundera is to make working with NoSQL databases
drop-dead simple and fun. It currently supports Cassandra, HBase, MongoDB,
Redis, OracleNoSQL, Neo4j,ElasticSearch and relational databases.


Major Changes:

1) Support for pagination over Mongodb.
2) Added elastic search as datastore and fallback indexing mechanism.

Github Bug Fixes:

https://github.com/impetus-opensource/Kundera/issues/234
https://github.com/impetus-opensource/Kundera/issues/215
https://github.com/impetus-opensource/Kundera/issues/201
https://github.com/impetus-opensource/Kundera/issues/333
https://github.com/impetus-opensource/Kundera/issues/362
https://github.com/impetus-opensource/Kundera/issues/350
https://github.com/impetus-opensource/Kundera/issues/365

How to Download:
To download, use or contribute to Kundera, visit:
http://github.com/impetus-opensource/Kundera

Latest released tag version is 2.7 Kundera maven libraries are now
available at:
https://oss.sonatype.org/content/repositories/releases/com/impetus

Sample codes and examples for using Kundera can be found here:
https://github.com/impetus-opensource/Kundera/tree/trunk/kundera-tests

Survey/Feedback:
http://www.surveymonkey.com/s/BMB9PWG

Thank you all for your contributions and using Kundera!


Sincerely,
Kundera Team








NOTE: This message may contain information that is confidential,
proprietary, privileged or otherwise protected by law. The message is
intended solely for the named addressee. If received in error, please
destroy and notify the sender. Any use of this email is prohibited when
received in error. Impetus does not represent, warrant and/or guarantee,
that the integrity of this communication has been maintained nor that the
communication is free of errors, virus, interception or interference.

--
You received this message because you are subscribed to the Google Groups
"kundera-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to kundera-discuss+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.