Re: SSTable Index and Metadata - are they cached in RAM?

2012-08-17 Thread Maciej Miklas
Great articles, I did not find those before !
*
SSTable Index - yes I mean column Index.

*I would like to understand, how many disk seeks might be required to find
column in single SSTable.

I am assuming positive bloom filter on row key. Now Cassandra needs to find
out whenever given SSTable contains column name, and this might require few
disk seeks:
1) Check key cache, if found go to 5)
2) Rad from disk all row keys, in order to find one (binary search)
3) Found row key contains disk offset to its column index
4) Read from disk column index for our row key. Index contains also bloom
filter on column names
5) Use bloom filter on column name, to find out whenever this SSTable might
contain our column
6) Read column to finally make sure that is exists

As I understand, in the worst case, we can have three disk seeks (2, 4, 6)
pro SSTable in order to check whenever it contains given column, it that
correct ?

I would expect, that sorted row keys (from point 2) ) already contain bloom
filter for their columns. But bloom filter is stored together with column
index, is that correct?


Cheers,
Maciej

On Fri, Aug 17, 2012 at 12:06 AM, aaron morton wrote:

> What about SSTable index,
>
> Not sure what you are referring to there. Each row has a in a SStable has
> a bloom filter and may have an index of columns. This is not cached.
>
> See http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ or
> http://www.slideshare.net/aaronmorton/cassandra-sf-2012-technical-deep-dive-query-performance
>
>  and Metadata?
>
> This is the meta data we hold in memory for every open sstable
>
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java
>
> Cheers
>
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 16/08/2012, at 7:34 PM, Maciej Miklas  wrote:
>
> Hi all,
>
> bloom filter for row keys is always in RAM. What about SSTable index, and
> Metadata?
>
> Is it cached by Cassandra, or it relays on memory mapped files?
>
>
> Thanks,
> Maciej
>
>
>


Re: nodetool repair uses insane amount of disk space

2012-08-17 Thread aaron morton
I would take a look at the replication: whats the RF per DC and what does 
nodetool ring say. It's hard (as in no recommended) to get NTS with rack 
allocation working correctly. Without know much more I would try to understand 
what the topology is and if it can be simplified. 

>> Additionally, the repair process takes (what I feel is) an extremely long 
>> time to complete (36+ hours), and it always seems that nodes are streaming 
>> data to each other, even on back-to-back executions of the repair.
Run some metrics to clock the network IO during repair. 
Also run an experiment to repair a single CF twice from the same node and look 
at the logs for the second run. This will give us an idea of how much data is 
being transferred. 
Note that very wide rows can result in large repair transfers as the whole row 
is diff'd and transferred if needed.
 
Hope that helps. 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 11:14 AM, Michael Morris  wrote:

> Upgraded to 1.1.3 from 1.0.8 about 2 weeks ago.
> 
> On Thu, Aug 16, 2012 at 5:57 PM, aaron morton  wrote:
> What version are using ? There were issues with repair using lots-o-space in 
> 0.8.X, it's fixed in 1.X
> 
> Cheers
> 
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 17/08/2012, at 2:56 AM, Michael Morris  wrote:
> 
>> Occasionally as I'm doing my regular anti-entropy repair I end up with a 
>> node that uses an exceptional amount of disk space (node should have about 
>> 5-6 GB of data on it, but ends up with 25+GB, and consumes the limited 
>> amount of disk space I have available)
>> 
>> How come a node would consume 5x its normal data size during the repair 
>> process?
>> 
>> My setup is kind of strange in that it's only about 80-100GB of data on a 35 
>> node cluster, with 2 data centers and 3 racks, however the rack assignments 
>> are unbalanced.  One data center has 8 nodes, and the other data center is 
>> split into 2 racks with one rack of 9 nodes, and the other with 18 nodes.  
>> However, within each rack, the tokens are distributed equally. It's a long 
>> sad story about how we ended up this way, but it basically boils down to 
>> having to utilize existing resources to resolve a production issue.
>> 
>> Additionally, the repair process takes (what I feel is) an extremely long 
>> time to complete (36+ hours), and it always seems that nodes are streaming 
>> data to each other, even on back-to-back executions of the repair.
>> 
>> Any help on these issues is appreciated.
>> 
>> - Mike
>> 
> 
> 



Re: Cassandra 1.0 row deletion

2012-08-17 Thread aaron morton
> If you use the remove function to delete an entire row, is that an atomic 
> operation?

Yes. Row level deletes are atomic. 

cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 3:39 PM, Derek Williams  wrote:

> On Thu, Aug 16, 2012 at 9:08 PM, Terry Cumaranatunge  
> wrote: 
> We have a Cassandra 1.0 cluster that we run with RF=3 and perform operations 
> using a consistency level of quorum. We use batch_mutate for all inserts and 
> updates for atomicity across column families with the same row key, but use 
> the thrift interface remove API call in C++ to delete a row so that we can 
> delete an entire row without having to specify individual column names. If 
> you use the remove function to delete an entire row, is that an atomic 
> operation? In other words, can it delete a partial number of columns in the 
> row and leave other columns around?
> 
> It all depends on the timestamp for the column. A row level delete will place 
> a row tombstone at the timestamp given, causing all columns with an earlier 
> timestamp to be deleted. If a column has a later timestamp then the row 
> tombstone, then it wont be deleted.
> 
> More info here: http://wiki.apache.org/cassandra/DistributedDeletes
> 
> -- 
> Derek Williams
> 



Re: SSTable Index and Metadata - are they cached in RAM?

2012-08-17 Thread aaron morton
> 2) Rad from disk all row keys, in order to find one (binary search) 
No.
At startup cassandra samples the -index.db component every index_interval keys. 
At worst index_interval keys must be read from disk. 

> As I understand, in the worst case, we can have three disk seeks (2, 4, 6) 
> pro SSTable in order to check whenever it contains given column, it that 
> correct ?
It depends on the size of the row. For a small (less than 
column_index_size_in_kb) size row it's to get a specific column it's :
* 1 seek in index.db
* 1 seek in data.db 

> I would expect, that sorted row keys (from point 2) ) already contain bloom 
> filter for their columns. But bloom filter is stored together with column 
> index, is that correct?
Yes

Hope that helps. 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 7:31 PM, Maciej Miklas  wrote:

> Great articles, I did not find those before !
> 
> SSTable Index - yes I mean column Index.
> 
> I would like to understand, how many disk seeks might be required to find 
> column in single SSTable.
> 
> I am assuming positive bloom filter on row key. Now Cassandra needs to find 
> out whenever given SSTable contains column name, and this might require few 
> disk seeks:
> 1) Check key cache, if found go to 5)
> 2) Rad from disk all row keys, in order to find one (binary search) 
> 3) Found row key contains disk offset to its column index
> 4) Read from disk column index for our row key. Index contains also bloom 
> filter on column names
> 5) Use bloom filter on column name, to find out whenever this SSTable might 
> contain our column
> 6) Read column to finally make sure that is exists
> 
> As I understand, in the worst case, we can have three disk seeks (2, 4, 6) 
> pro SSTable in order to check whenever it contains given column, it that 
> correct ?
> 
> I would expect, that sorted row keys (from point 2) ) already contain bloom 
> filter for their columns. But bloom filter is stored together with column 
> index, is that correct?
> 
> 
> Cheers,
> Maciej
> 
> On Fri, Aug 17, 2012 at 12:06 AM, aaron morton  
> wrote:
>> What about SSTable index, 
> Not sure what you are referring to there. Each row has a in a SStable has a 
> bloom filter and may have an index of columns. This is not cached. 
> 
> See http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ or 
> http://www.slideshare.net/aaronmorton/cassandra-sf-2012-technical-deep-dive-query-performance
> 
>>  and Metadata?
> 
> This is the meta data we hold in memory for every open sstable
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java
> 
> Cheers
>   
> 
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 16/08/2012, at 7:34 PM, Maciej Miklas  wrote:
> 
>> Hi all,
>> 
>> bloom filter for row keys is always in RAM. What about SSTable index, and 
>> Metadata?
>> 
>> Is it cached by Cassandra, or it relays on memory mapped files?
>> 
>> 
>> Thanks,
>> Maciej
> 
> 



Re: Omitting empty columns from CQL SELECT

2012-08-17 Thread aaron morton
If you specify the columns by name in the select clause the query returns them 
because they should be projected in the result set. 

Can you use a column slice instead ?

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 11:09 AM, Mat Brown  wrote:

> Hello all,
> 
> I've noticed that when performing a SELECT statement with a list of
> columns specified, Cassandra returns all columns in the resulting
> row(s) even if they have no value. This creates an apparently
> considerable amount of transport and deserialization overhead,
> particularly in one use case I'm looking at, in which we select a
> large collection of columns but expect only a small fraction of them
> to contain values. Is there any way to get around this and only
> receive columns that have values in the results?
> 
> Thanks,
> Mat



Understanding UnavailableException

2012-08-17 Thread Mohit Agarwal
Hi guys,

I am trying to understand what happens when an UnavailableException is
thrown.

a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node cluster.
My understanding is that if one of the nodes is down and the coordinator
node is aware of that(through gossip), then it will respond to the request
with an UnavailableException. Is this correct?

b) What happens if the coordinator isn't aware of a node being down and
sends the request to all the nodes and never hears back from one of the
node. Would this result in a TimedOutException or a UnavailableException?

c) I am trying to understand the cases where the client receives an error,
but data could have been inserted into Cassandra. One such case is the
TimedOutException. Are there any other situations like these?

Thanks,
Mohit


Re: Omitting empty columns from CQL SELECT

2012-08-17 Thread Mat Brown
Hi Aaron,

Thanks for the answer. That makes sense and I can see it as a formal
reason for returning empty columns, but as a practical matter, is
there a situation in which that behavior would be useful?

Unfortunately a column slice won't do the trick -- the columns we're
looking for at any given time wouldn't correspond to a particular
range; it's essentially "random access".

For what it's worth, I've managed to make this operation about 30x
faster in a quick benchmark by just not selecting for specific columns
at all, and throwing away columns I don't care about in the
application layer instead. It's unclear whether the performance
improvements will continue to accrue as the column family becomes more
densely populated, though.

Anyway, thanks again!
Mat

On Fri, Aug 17, 2012 at 5:06 AM, aaron morton  wrote:
> If you specify the columns by name in the select clause the query returns
> them because they should be projected in the result set.
>
> Can you use a column slice instead ?
>
> Cheers
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/08/2012, at 11:09 AM, Mat Brown  wrote:
>
> Hello all,
>
> I've noticed that when performing a SELECT statement with a list of
> columns specified, Cassandra returns all columns in the resulting
> row(s) even if they have no value. This creates an apparently
> considerable amount of transport and deserialization overhead,
> particularly in one use case I'm looking at, in which we select a
> large collection of columns but expect only a small fraction of them
> to contain values. Is there any way to get around this and only
> receive columns that have values in the results?
>
> Thanks,
> Mat
>
>


What is the ideal server-side technology stack to use with Cassandra?

2012-08-17 Thread Andy Ballingall TF
Hi,

I've been running a number of tests with Cassandra using a couple of
PHP drivers (namely PHPCassa (https://github.com/thobbs/phpcassa/) and
PDO-cassandra (http://code.google.com/a/apache-extras.org/p/cassandra-pdo/),
and the experience hasn't been great, mainly because I can't try out
the CQL3.

Aaron Morton (aa...@thelastpickle.com) advised:

"If possible i would avoid using PHP. The PHP story with cassandra has
not been great in the past. There is little love for it, so it takes a
while for work changes to get in the client drivers.

AFAIK it lacks server side states which makes connection pooling
impossible. You should not pool cassandra connections in something
like HAProxy."

So my question is - if you were to build a new scalable project from
scratch tomorrow sitting on top of Cassandra, which technologies would
you select to serve HTTP requests to ensure you get:

a) The best support from the cassandra community (e.g. timely updates
of drivers, better stability)
b) Optimal efficiency between webservers and cassandra cluster, in
terms of the performance of individual requests and in the volumes of
connections handled per second
c) Ease of development and and deployment.

What worked for you, and why? What didn't work for you?


Thanks,
Andy


-- 
Andy Ballingall
Senior Software Engineer

The Foundry
6th Floor, The Communications Building,
48, Leicester Square,
London, WC2H 7LT, UK
Tel: +44 (0)20 7968 6828 - Fax: +44 (0)20 7930 8906
Web: http://www.thefoundry.co.uk/

The Foundry Visionmongers Ltd.
Registered in England and Wales No: 4642027


Re: What is the ideal server-side technology stack to use with Cassandra?

2012-08-17 Thread Tim Wintle
On Fri, 2012-08-17 at 11:09 +0100, Andy Ballingall TF wrote:
> So my question is - if you were to build a new scalable project from
> scratch tomorrow sitting on top of Cassandra, which technologies would
> you select to serve HTTP requests to ensure you get:
> 
> a) The best support from the cassandra community (e.g. timely updates
> of drivers, better stability)
> b) Optimal efficiency between webservers and cassandra cluster, in
> terms of the performance of individual requests and in the volumes of
> connections handled per second
> c) Ease of development and and deployment.
> 
> What worked for you, and why? What didn't work for you?

We do almost everything in python, so our stack is basically
python-everywhere (with a bit of C and a bit of PHP).

If you're most comfortable in PHP, I'd suggest writing a data layer in
another language (Java or python) which handles the cassandra requests,
and then making requests back to that from PHP.

That's general advice for any scalable system though - the frontends are
stateless and can be scaled out horizontally (with caching if it fits
your requirements).

If you split your Data layer into parts that are stateless and parts
which aren't then you can load balance the horizontally scalable parts
of that layer using something like haproxy too if you need to.

Tim Wintle



Re: Understanding UnavailableException

2012-08-17 Thread Maciej Miklas
UnavailableException is bit tricky. It means, that not all replicas
required by CL received update. Actually you do not know, whenever update
was stored or not, and actually what went wrong.

This is the case, why writing with CL.ALL might get problematic. It is
enough, that only one replica is off-line and you will get exception.
Remember also, that CL.ALL means, all replicas in all Data Centers - not
only local DC. Writing with QUORUM_LOCAL could be better idea.

There is only one CL, where exception guarantees, that data was really not
stored: CL.ANY with hinted handoff enabled.

One more thing: write goes always to all replicas independent from provided
CL. Client request blocks only until required replicas respond - however
this response is asynchronous. This means, when you write with lower CL,
replicas will get data with the same speed, only your client does not wait
for acknowledgment from all of them.

Ciao,
Maciej


On Fri, Aug 17, 2012 at 11:07 AM, Mohit Agarwal wrote:

> Hi guys,
>
> I am trying to understand what happens when an UnavailableException is
> thrown.
>
> a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node cluster.
> My understanding is that if one of the nodes is down and the coordinator
> node is aware of that(through gossip), then it will respond to the request
> with an UnavailableException. Is this correct?
>
> b) What happens if the coordinator isn't aware of a node being down and
> sends the request to all the nodes and never hears back from one of the
> node. Would this result in a TimedOutException or a UnavailableException?
>
> c) I am trying to understand the cases where the client receives an error,
> but data could have been inserted into Cassandra. One such case is the
> TimedOutException. Are there any other situations like these?
>
> Thanks,
> Mohit
>


Re: indexing question related to playOrm on github

2012-08-17 Thread Hiller, Dean
I am not sure what you mean by play with the timestamp.  I think this works 
without playing with the timestamp(thanks for you help as it got me here).

 1.  On a scan I hit 
 2.  I end up looking up the pk
 3.  I compare the value in the row with the indexed value "mike" but I see the 
row with that pk has Sam not Mike
 4.  I now know I can discard this result as a false positive.  I also know my 
index has duplicates.
 5.  I kick off a job to scan the complete index now AND read in each pk row of 
the index comparing indexed value with the actual value in the row to fix the 
index.

I think that might work pretty well.

Thanks,
Dean

From: aaron morton mailto:aa...@thelastpickle.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Thursday, August 16, 2012 4:55 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: indexing question related to playOrm on github

 I am not sure synchronization fixes thatŠŠIt would be kind of
nice if the column <65> would not actually be removed until after
all servers are eventually consistent...
Not sure thats possible.

You can either serialise updating your custom secondary index on the client 
site or resolve the inconsistency on read.

Not sure this fits with your workload but as an e.g. when you read from the 
index, if you detect multiple row PK's resolve the issue on the client and 
leave the data in cassandra as is. Then queue a job that will read the row and 
try to repair it's index entries. When repairing the index entry play with the 
timestamp so any deletions you make only apply to the column as it was when you 
saw the error.

Hope that helps.


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 12:47 AM, "Hiller, Dean" 
mailto:dean.hil...@nrel.gov>> wrote:

Maybe this would be a special type of column family that could contain
these as my other tables definitely don't want the feature below by the
way.

Dean

On 8/16/12 6:29 AM, "Hiller, Dean" 
mailto:dean.hil...@nrel.gov>> wrote:

Yes, the synch may work, and no, I do "not" want a transactionŠI want a
different kind of eventually consistent

That might work.
Let's say server 1 sends a mutation (65 is the pk)
Remove: <65>  Add <65>
Server 2 also sends a mutation (65 is the pk)
Remove: <65> Add <65>

What everyone does not want is to end up with a row that has <65>
and <65>.  With the wide row pattern, we would like to have ONE or
the other.  I am not sure synchronization fixes thatŠŠIt would be kind of
nice if the column <65> would not actually be removed until after
all servers are eventually consistent AND would keep a reference to the
add that was happening so that when it goes to resolve eventually
consistent between the servers, it would see that <65> is newer and
it would decide to drop the first add completely.

Ie. In a full process it might look like this
Cassandra node 1 receives remove <65>, add <65> AND in the
remove column stores info about the add <65> until eventual
consistency is completed
Cassandra node 2 one ms later receives remove <65> and <65>
AND in the remove column stores info about the add <65> until
eventual consistency is completed
Eventual consistency starts comparing node 1 and node 2 and finds
<65> is being removed by different servers and finds add info
attached to that.  ONLY THE LAST add info is acknowledged and it makes
the row consistent across the cluster.

That makes everyone's wide row indexing pattern tend to get less corrupt
over time.

Thanks,
Dean


From: aaron morton
mailto:aa...@thelastpickle.com>>
Reply-To: 
"user@cassandra.apache.org"
mailto:user@cassandra.apache.org>>
Date: Wednesday, August 15, 2012 8:26 PM
To: 
"user@cassandra.apache.org"
mailto:user@cassandra.apache.org>>
Subject: Re: indexing question related to playOrm on github

1.  Can playOrm be listed on cassandra's list of ORMs?  It supports a
JQL/HQL query on a trillion rows in under 100ms (partitioning is the
trick so you can JQL a partition)
No sure if we have an ORM specific page. If it's a client then feel free
to add it to http://wiki.apache.org/cassandra/ClientOptions

I was wondering if cassandra has or will ever support eventual constancy
where it keeps both the REMOVE AND the ADD together such until it is on
all 3 replicated nodes and in resolving the consistency would end up with
an index that only has the very last one in the index.
Not sure I fully understand but it sounds like you want a transaction,
which is not going to happen.

Internally when Cassandra updates a secondary index it does the same
thing. But it synchronises updates around the same row so one thread 

Re: What is the ideal server-side technology stack to use with Cassandra?

2012-08-17 Thread Edward Capriolo
The best stack is the THC stack. :)

Tomcat Hadoop Cassandra :)

On Fri, Aug 17, 2012 at 6:09 AM, Andy Ballingall TF
 wrote:
> Hi,
>
> I've been running a number of tests with Cassandra using a couple of
> PHP drivers (namely PHPCassa (https://github.com/thobbs/phpcassa/) and
> PDO-cassandra (http://code.google.com/a/apache-extras.org/p/cassandra-pdo/),
> and the experience hasn't been great, mainly because I can't try out
> the CQL3.
>
> Aaron Morton (aa...@thelastpickle.com) advised:
>
> "If possible i would avoid using PHP. The PHP story with cassandra has
> not been great in the past. There is little love for it, so it takes a
> while for work changes to get in the client drivers.
>
> AFAIK it lacks server side states which makes connection pooling
> impossible. You should not pool cassandra connections in something
> like HAProxy."
>
> So my question is - if you were to build a new scalable project from
> scratch tomorrow sitting on top of Cassandra, which technologies would
> you select to serve HTTP requests to ensure you get:
>
> a) The best support from the cassandra community (e.g. timely updates
> of drivers, better stability)
> b) Optimal efficiency between webservers and cassandra cluster, in
> terms of the performance of individual requests and in the volumes of
> connections handled per second
> c) Ease of development and and deployment.
>
> What worked for you, and why? What didn't work for you?
>
>
> Thanks,
> Andy
>
>
> --
> Andy Ballingall
> Senior Software Engineer
>
> The Foundry
> 6th Floor, The Communications Building,
> 48, Leicester Square,
> London, WC2H 7LT, UK
> Tel: +44 (0)20 7968 6828 - Fax: +44 (0)20 7930 8906
> Web: http://www.thefoundry.co.uk/
>
> The Foundry Visionmongers Ltd.
> Registered in England and Wales No: 4642027


Re: Understanding UnavailableException

2012-08-17 Thread Mohit Agarwal
Does this mean that the coordinator sends requests to all nodes, even when
it  knows that sufficient number of nodes are not available, via gossip?

On Fri, Aug 17, 2012 at 4:49 PM, Maciej Miklas  wrote:

> UnavailableException is bit tricky. It means, that not all replicas
> required by CL received update. Actually you do not know, whenever update
> was stored or not, and actually what went wrong.
>
> This is the case, why writing with CL.ALL might get problematic. It is
> enough, that only one replica is off-line and you will get exception.
> Remember also, that CL.ALL means, all replicas in all Data Centers - not
> only local DC. Writing with QUORUM_LOCAL could be better idea.
>
> There is only one CL, where exception guarantees, that data was really not
> stored: CL.ANY with hinted handoff enabled.
>
> One more thing: write goes always to all replicas independent from
> provided CL. Client request blocks only until required replicas respond -
> however this response is asynchronous. This means, when you write with
> lower CL, replicas will get data with the same speed, only your client does
> not wait for acknowledgment from all of them.
>
> Ciao,
> Maciej
>
>
>
> On Fri, Aug 17, 2012 at 11:07 AM, Mohit Agarwal wrote:
>
>> Hi guys,
>>
>> I am trying to understand what happens when an UnavailableException is
>> thrown.
>>
>> a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node cluster.
>> My understanding is that if one of the nodes is down and the coordinator
>> node is aware of that(through gossip), then it will respond to the request
>> with an UnavailableException. Is this correct?
>>
>> b) What happens if the coordinator isn't aware of a node being down and
>> sends the request to all the nodes and never hears back from one of the
>> node. Would this result in a TimedOutException or a UnavailableException?
>>
>> c) I am trying to understand the cases where the client receives an
>> error, but data could have been inserted into Cassandra. One such case is
>> the TimedOutException. Are there any other situations like these?
>>
>> Thanks,
>> Mohit
>>
>
>


Re: Opscenter 2.1 vs 1.3

2012-08-17 Thread Nick Bailey
Robin,

Are you talking about total writes to the cluster, writes to  a
specific column family, or something else?

There has been some changes to OpsCenters metric collection/storage
system but nothing that should cause something like that. Also its
possible the number of writes to the OpsCenter keyspace itself would
have changed quite a bit between those versions, I'm assuming you
don't mean the column families in the OpsCenter keyspace though right?

-Nick

On Thu, Aug 16, 2012 at 7:05 PM, aaron morton  wrote:
> You may have better luck on the Data Stax forums
> http://www.datastax.com/support-forums/
>
> Cheers
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/08/2012, at 4:36 AM, Robin Verlangen  wrote:
>
> Hi there,
>
> I just upgraded to opscenter 2.1 (from 1.3). It appears that my writes have
> tripled. Is this a change in the display/measuring of opscenter?
>
>
> Best regards,
>
> Robin Verlangen
> Software engineer
>
> W http://www.robinverlangen.nl
> E ro...@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>


Re: Understanding UnavailableException

2012-08-17 Thread Nick Bailey
This blog post should help:

http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure

But to answer your question:

>> UnavailableException is bit tricky. It means, that not all replicas
>> required by CL received update. Actually you do not know, whenever update
>> was stored or not, and actually what went wrong.
>>

This is actually incorrect. If you get an UnavailableException, the
write was rejected by the coordinator and was not written anywhere.


>>> a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node cluster.
>>> My understanding is that if one of the nodes is down and the coordinator
>>> node is aware of that(through gossip), then it will respond to the request
>>> with an UnavailableException. Is this correct?

Correct

>>>
>>> b) What happens if the coordinator isn't aware of a node being down and
>>> sends the request to all the nodes and never hears back from one of the
>>> node. Would this result in a TimedOutException or a UnavailableException?
>>>

You will get a TimedOutException

>>> c) I am trying to understand the cases where the client receives an
>>> error, but data could have been inserted into Cassandra. One such case is
>>> the TimedOutException. Are there any other situations like these?
>>>

This should be the only case.


Re: wild card on query

2012-08-17 Thread Swathi Vikas
Thank you very much Aaron. Information you provided is very helpful.
 
Have a great Weekend!!!
swat.vikas
 



From: aaron morton 
To: user@cassandra.apache.org 
Sent: Thursday, August 16, 2012 6:29 PM
Subject: Re: wild card on query

> I want to retrieve all the photos from all the users of certain project. My 
> sql like query will be "select projectid * photos from Users". How can i run 
> this kind of row key predicate while executing query on cassandra?
You cannot / should not do that using the data model you have. (i.e. you could 
do it with a secondary index, but in this case you probably should not).

Try to de-normalise your data. 

Say a CF called ProjectPhotos

* row key is the project_id
* column name is 
* column value is image_url or JSON data about the image. 

You would then slice some columns from one row in the  ProjectPhotos CF. 

You then need to know what images a user has uploaded, with say the UserPhotos 
CF. 

* row key is user_id
* column name is timestamp
* column value is image_url or JSON data about the image. 

I did a twitter sample app at http://wdcnz.com a couple of weeks ago that shows 
denormalising data  https://github.com/amorton/wdcnz-2012-site and 
http://www.slideshare.net/aaronmorton/hellow-world-cassandra

Hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 12:39 AM, Swathi Vikas  wrote:

> Hi,
> 
> I am trying to run query on cassandra cluster with predicate on row key.
> 
> I have column family called "Users" and rows with row key like 
> "projectid_userid_photos". Each user within a project can have rows like 
> projectid_userid_blog, projectid_userid_status and so on. 
> 
> I want to retrieve all the photos from all the users of certain project. My 
> sql like query will be "select projectid * photos from Users". How can i run 
> this kind of row key predicate while executing query on cassandra?
> 
> Any sugesstion will help. 
> 
> Thank you,
> swat.vikas
>>> 
>>> 
>> 
>> 
>> 
> 
> 
> 

Re: C++ Bulk loader and Result set streaming.

2012-08-17 Thread Swathi Vikas
1) For now i am using sstableloader. I think, some time later i will write some 
code using RPC.
 
2) Yes, I looked into many blogs and found information that i need to use last 
index to retrieve next 100 rows. I was trying to save some time if some one has 
already done this kind of streaming. I will write code to do that.
 
Thank you very much,
Swat.vikas



From: aaron morton 
To: user@cassandra.apache.org; Swathi Vikas  
Sent: Thursday, August 16, 2012 7:06 PM
Subject: Re: C++ Bulk loader and Result set streaming.


But i couldn't find any information on bulk loading using C++ client 
interface.You cannot.  
To bulk load data use the sstableloader, otherwise you need to use the RPC / 
CQL API. 

2) I want to retrieve all the result of the query(not just first 100 result 
set) using C++ client. Is there any C++ supporting code or information on 
streaming the result set into a file or something.I've not looked at the C++ 
client, but normally you use the last column returned as the start column for 
the next call. 

Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com/

On 17/08/2012, at 6:08 AM, Swathi Vikas  wrote:

Hi All,
>
>I am using C++ client libQtCassandra. I have two questions.
>
>1) I want to bulk load data into cassandra through C++ interface. It is 
>required by my group where i am doing internship. I could bulk load using 
>sstableloader as specified in Datastax 
>:http://www.datastax.com/dev/blog/bulk-loading. But i couldn't find any 
>information on bulk loading using C++ client interface. 
>
>2) I want to retrieve all the result of the query(not just first 100 result 
>set) using C++ client. Is there any C++ supporting code or information on 
>streaming the result set into a file or something.
>
>If anyone has any information please direct me where i can look into.
>
>Thank you very much,
>Swat.vikas

Re: nodetool repair uses insane amount of disk space

2012-08-17 Thread Jim Cistaro
We see similar issues with some of the repairs at Netflix.

Regarding the growth in payload… we see similar symptoms where nodes can double 
or triple size.  Part of this may be because the repair may deal in large 
chunks for comparisons.  This means that even if there is one byte of entropy, 
you copy over a large chunk.  Another reason for the large growth is that if 
node A is inconsistent with replicas on B and C, you will copy over multiple 
sets of large chunks (one from each of the replicas) - even more sets in a 
multi datacenter environment.  (We are still investigating/analyzing the causes 
of such occurrences in our clusters – the above explanation is a possible 
cause.)

Are you only seeing growth on one node in the system?  You might want to check 
if other nodes logs show gossip issues with this node (and then check if you 
are creating a lot of hints and check your hint settings to make sure you save 
and replay them) and that may be why you see this even on back to back 
execution.

It is worth noting that we do major compactions (I am not suggesting you do 
this, just pointing it out for reference) and then see the payload shrink back 
down to normal.  So a lot of that payload increase appears to be redundant 
(most likely due to the chunking issue above).

Regarding processing time… Are you repairing each node serially? Are you 
repairing with primary range option?
AFAIK … You most likely want to use –pr.  Otherwise the further you get into 
the list of nodes, the more data has to go through the validation compaction 
(because you increased the size of some of your nodes).  Using –pr means you 
only repair a range once when repairing the cluster.  Without it, you repair 
the range on each node/replica.

Jim


From: aaron morton mailto:aa...@thelastpickle.com>>
Reply-To: mailto:user@cassandra.apache.org>>
Date: Fri, 17 Aug 2012 20:40:54 +1200
To: mailto:user@cassandra.apache.org>>
Subject: Re: nodetool repair uses insane amount of disk space

I would take a look at the replication: whats the RF per DC and what does 
nodetool ring say. It's hard (as in no recommended) to get NTS with rack 
allocation working correctly. Without know much more I would try to understand 
what the topology is and if it can be simplified.

Additionally, the repair process takes (what I feel is) an extremely long time 
to complete (36+ hours), and it always seems that nodes are streaming data to 
each other, even on back-to-back executions of the repair.
Run some metrics to clock the network IO during repair.
Also run an experiment to repair a single CF twice from the same node and look 
at the logs for the second run. This will give us an idea of how much data is 
being transferred.
Note that very wide rows can result in large repair transfers as the whole row 
is diff'd and transferred if needed.

Hope that helps.


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 11:14 AM, Michael Morris 
mailto:michael.m.mor...@gmail.com>> wrote:

Upgraded to 1.1.3 from 1.0.8 about 2 weeks ago.

On Thu, Aug 16, 2012 at 5:57 PM, aaron morton 
mailto:aa...@thelastpickle.com>> wrote:
What version are using ? There were issues with repair using lots-o-space in 
0.8.X, it's fixed in 1.X

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2012, at 2:56 AM, Michael Morris 
mailto:michael.m.mor...@gmail.com>> wrote:

Occasionally as I'm doing my regular anti-entropy repair I end up with a node 
that uses an exceptional amount of disk space (node should have about 5-6 GB of 
data on it, but ends up with 25+GB, and consumes the limited amount of disk 
space I have available)

How come a node would consume 5x its normal data size during the repair process?

My setup is kind of strange in that it's only about 80-100GB of data on a 35 
node cluster, with 2 data centers and 3 racks, however the rack assignments are 
unbalanced.  One data center has 8 nodes, and the other data center is split 
into 2 racks with one rack of 9 nodes, and the other with 18 nodes.  However, 
within each rack, the tokens are distributed equally. It's a long sad story 
about how we ended up this way, but it basically boils down to having to 
utilize existing resources to resolve a production issue.

Additionally, the repair process takes (what I feel is) an extremely long time 
to complete (36+ hours), and it always seems that nodes are streaming data to 
each other, even on back-to-back executions of the repair.

Any help on these issues is appreciated.

- Mike






Re: What is the ideal server-side technology stack to use with Cassandra?

2012-08-17 Thread Aaron Turner
My stack:

Java + JRuby + Rails + Torquebox

I'm using the Hector client (arguably the most mature out there) and
JRuby+RoR+Torquebox gives me a great development platform which really
scales (full native thread support for example) and is extremely
powerful.  Honestly I expect, all my future RoR apps will be built on
JRuby/Torquebox because I've been so happy with it even if I don't
have a specific need to utilize Java libraries from inside the app.

And the best part is that I've yet to have to write a single line of Java! :)



On Fri, Aug 17, 2012 at 6:53 AM, Edward Capriolo  wrote:
> The best stack is the THC stack. :)
>
> Tomcat Hadoop Cassandra :)
>
> On Fri, Aug 17, 2012 at 6:09 AM, Andy Ballingall TF
>  wrote:
>> Hi,
>>
>> I've been running a number of tests with Cassandra using a couple of
>> PHP drivers (namely PHPCassa (https://github.com/thobbs/phpcassa/) and
>> PDO-cassandra (http://code.google.com/a/apache-extras.org/p/cassandra-pdo/),
>> and the experience hasn't been great, mainly because I can't try out
>> the CQL3.
>>
>> Aaron Morton (aa...@thelastpickle.com) advised:
>>
>> "If possible i would avoid using PHP. The PHP story with cassandra has
>> not been great in the past. There is little love for it, so it takes a
>> while for work changes to get in the client drivers.
>>
>> AFAIK it lacks server side states which makes connection pooling
>> impossible. You should not pool cassandra connections in something
>> like HAProxy."
>>
>> So my question is - if you were to build a new scalable project from
>> scratch tomorrow sitting on top of Cassandra, which technologies would
>> you select to serve HTTP requests to ensure you get:
>>
>> a) The best support from the cassandra community (e.g. timely updates
>> of drivers, better stability)
>> b) Optimal efficiency between webservers and cassandra cluster, in
>> terms of the performance of individual requests and in the volumes of
>> connections handled per second
>> c) Ease of development and and deployment.
>>
>> What worked for you, and why? What didn't work for you?

-- 
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
"carpe diem quam minimum credula postero"


Re: Understanding UnavailableException

2012-08-17 Thread Mohit Agarwal
Thanks Nick for your answers. The blog post is very well written and was
much needed i guess.

On Fri, Aug 17, 2012 at 8:30 PM, Nick Bailey  wrote:

> This blog post should help:
>
> http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure
>
> But to answer your question:
>
> >> UnavailableException is bit tricky. It means, that not all replicas
> >> required by CL received update. Actually you do not know, whenever
> update
> >> was stored or not, and actually what went wrong.
> >>
>
> This is actually incorrect. If you get an UnavailableException, the
> write was rejected by the coordinator and was not written anywhere.
>
>
> >>> a) Suppose we are doing a ConsistencyLevel.ALL write on a 3 node
> cluster.
> >>> My understanding is that if one of the nodes is down and the
> coordinator
> >>> node is aware of that(through gossip), then it will respond to the
> request
> >>> with an UnavailableException. Is this correct?
>
> Correct
>
> >>>
> >>> b) What happens if the coordinator isn't aware of a node being down and
> >>> sends the request to all the nodes and never hears back from one of the
> >>> node. Would this result in a TimedOutException or a
> UnavailableException?
> >>>
>
> You will get a TimedOutException
>
> >>> c) I am trying to understand the cases where the client receives an
> >>> error, but data could have been inserted into Cassandra. One such case
> is
> >>> the TimedOutException. Are there any other situations like these?
> >>>
>
> This should be the only case.
>


Re: Understanding UnavailableException

2012-08-17 Thread Russell Haering
On Fri, Aug 17, 2012 at 8:00 AM, Nick Bailey  wrote:
> This is actually incorrect. If you get an UnavailableException, the
> write was rejected by the coordinator and was not written anywhere.

Last time I checked, this was not true for batch writes. The row
mutations were started sequentially (ie, for each mutation check
availability, then kick off an aynchronous write), so it was possible
for the first to succeed, and the second to fail with an
UnavailableException.

We had this exact thing happen to us with a custom secondary indexing
system, where we wrote the index but not the data, which at the time
broke a few assumptions we had made.

I would support changing this so that availability is evaluated for
all rows in an initial pass, and once that pass has completed there
would be no circumstances under which an UnavailableException would be
thrown. But the whole thing is of limited value because you could
still get a TImedOutException, there's no way around needing to handle
the "I don't know what got written" scenario.


Re: Why the StageManager thread pools have 60 seconds keepalive time?

2012-08-17 Thread Guillermo Winkler
Aaron, thanks for your answer.

I'm actually tracking a problem where mutations get dropped and cfstats
show no activity whatsoever, I have 100 threads for the mutation pool, no
running or pending tasks, but some mutations get dropped none the less.

I'm thinking about some scheduling problems but not really sure yet.

Have you ever seen a case of dropped mutations with the system under light
load?

Thanks,
Guille


On Thu, Aug 16, 2012 at 8:22 PM, aaron morton wrote:

> That's some pretty old code. I would guess it was done that way to
> conserve resources. And _i think_ thread creation is pretty light weight.
>
> Jonathan / Brandon / others - opinions ?
>
> Cheers
>
>
>   -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/08/2012, at 8:09 AM, Guillermo Winkler 
> wrote:
>
> Hi, I have a cassandra cluster where I'm seeing a lot of thread trashing
> from the mutation pool.
>
> MutationStage:72031
>
> Where threads get created and disposed in 100's batches every few minutes,
> since it's a 16 core server concurrent_writes is set in 100 in the
> cassandra.yaml.
>
> concurrent_writes: 100
>
> I've seen in the StageManager class this pools get created with 60 seconds
> keepalive time.
>
> DebuggableThreadPoolExecutor -> allowCoreThreadTimeOut(true);
>
> StageManager-> public static final long KEEPALIVE = 60; // seconds to keep
> "extra" threads alive for when idle
>
> Is it a reason for it to be this way?
>
> Why not have a fixed size pool with Integer.MAX_VALUE as keepalive since
> corePoolSize and maxPoolSize are set at the same size?
>
> Thanks,
> Guille
>
>
>


Re: composite table with cassandra without using cql3?

2012-08-17 Thread Ben Frank
Hi Dean,
   I'm interested in this too, but I get a 404 with the link below, looks
like I can't see your nosqlORM project.

-Ben

On Thu, Aug 2, 2012 at 9:04 AM, Hiller, Dean  wrote:

> For how to do it with astyanax, you can see here...
>
> Lines 310 and 335
>
> https://github.com/deanhiller/nosqlORM/blob/indexing/input/javasrc/com/alva
> zan/orm/layer3/spi/db/cassandra/CassandraSession.java
>
>
> For how to do with thrift, you could look at astyanax.
>
> I use it on that project for indexing for the ORM layer we use(which is
> not listed on the cassandra ORM's page as of yet ;) ).
>
> Later,
> Dean
>
>
> On 8/2/12 9:50 AM, "Greg Fausak"  wrote:
>
> >I've been using the cql3 to create a composite table.
> >Can I use the thrift interface to accomplish the
> >same thing?  In other words, do I have to use cql 3 to
> >get a composite table type? (The same behavior as
> >multiple PRIMARY key columns).
> >
> >Thanks,
> >---greg
>
>


Re: nodetool repair uses insane amount of disk space

2012-08-17 Thread Peter Schuller
> How come a node would consume 5x its normal data size during the repair
> process?

https://issues.apache.org/jira/browse/CASSANDRA-2699

It's likely a variation based on how out of synch you happen to be,
and whether you have a neighbor that's also been repaired and bloated
up already.

> My setup is kind of strange in that it's only about 80-100GB of data on a 35
> node cluster, with 2 data centers and 3 racks, however the rack assignments
> are unbalanced.  One data center has 8 nodes, and the other data center is
> split into 2 racks with one rack of 9 nodes, and the other with 18 nodes.
> However, within each rack, the tokens are distributed equally. It's a long
> sad story about how we ended up this way, but it basically boils down to
> having to utilize existing resources to resolve a production issue.

https://issues.apache.org/jira/browse/CASSANDRA-3810

In terms of DCs, different DC:s are effectively independent of each
other in terms of replica placement. So there is no need or desire for
two DC:s to be symmetrical.

The racks are important though if you are trying to take advantage of
racks being somewhat independent failure domains (for reasons outlined
in 3810 above).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)