repair and amount of transfers

2011-06-14 Thread Terje Marthinussen
Hi,

I have been testing repairs a bit in different ways on 0.8.0 and I am
curious on what to really expect in terms of data transferred.

I would expect my data to be fairly consistent in this case from the start.
More than a billion supercolumns, but there was no errors in feed and we
have seen minimal amounts of read repair going on while doing a complete
scan of the data for consistency checking. As such, I would also expect
repair to finish reasonably fast.

On some nodes, it finishes in a couple of hours, but other nodes it is
taking more than 12 hours and I see some 65GB of data streamed to the node
which surprises me as I am pretty sure that it is not that out of sync.

Not sure how much the merkle trees are actually reducing what needs to be
streamed though.

What should we expect to see if this works?

Regards,
Terje


Re: odd logs after repair

2011-06-14 Thread Sasha Dolgy
Hi ...

Does anyone else see these type of INFO messages in their log files,
or is i just me..?

INFO [manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec] 2011-06-13
21:28:39,877 AntiEntropyService.java (line 177) Excluding
/10.128.34.18 from repair because it is on version 0.7 or sooner. You
should consider updating this node before running repair again.
ERROR [manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec] 2011-06-13
21:28:39,877 AbstractCassandraDaemon.java (line 113) Fatal exception
in thread Thread[manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec,5,RMI
Runtime]
java.util.ConcurrentModificationException
   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
   at java.util.HashMap$KeyIterator.next(HashMap.java:828)
   at 
org.apache.cassandra.service.AntiEntropyService.getNeighbors(AntiEntropyService.java:173)
   at 
org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:776)

I'm at a loss as to why this is showing up in the logs.
-sd

On Mon, Jun 13, 2011 at 3:58 PM, Sasha Dolgy  wrote:
> hm.  that's not it.  we've been using a non-standard jmx port for some 
> time
>
> i've dropped the keyspace and recreated ...
>
> wonder if that'll help
>
> On Mon, Jun 13, 2011 at 3:57 PM, Tyler Hobbs  wrote:
>> On Mon, Jun 13, 2011 at 8:41 AM, Sasha Dolgy  wrote:
>>>
>>> I recall there being a discussion about a default port changing from
>>> 0.7.x to 0.8.x ...this was JMX, correct?  Or were there others.
>>
>> Yes, the default JMX port changed from 8080 to 7199.  I don't think there
>> were any others.


Re: odd logs after repair

2011-06-14 Thread Sylvain Lebresne
The exception itself is a bug (I've created
https://issues.apache.org/jira/browse/CASSANDRA-2767 to fix it).

However, the important message is the previous one (Even if the
exception was not thrown, repair wouldn't be able to work correctly,
so the fact that the exception is thrown is not such a big deal).
Apparently, from the standpoint of whomever node this logs is from,
the node 10.128.34.18 is still running 0.7. You should check if it is
the case (restarting 10.128.34.18 and look for something like
'Cassandra version: 0.8.0' is one solution). If the does does run
0.8.0 and you still get this error, then it would point to a problem
with our detection of the nodes.

--
Sylvain

On Tue, Jun 14, 2011 at 9:55 AM, Sasha Dolgy  wrote:
> Hi ...
>
> Does anyone else see these type of INFO messages in their log files,
> or is i just me..?
>
> INFO [manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec] 2011-06-13
> 21:28:39,877 AntiEntropyService.java (line 177) Excluding
> /10.128.34.18 from repair because it is on version 0.7 or sooner. You
> should consider updating this node before running repair again.
> ERROR [manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec] 2011-06-13
> 21:28:39,877 AbstractCassandraDaemon.java (line 113) Fatal exception
> in thread Thread[manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec,5,RMI
> Runtime]
> java.util.ConcurrentModificationException
>       at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>       at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>       at 
> org.apache.cassandra.service.AntiEntropyService.getNeighbors(AntiEntropyService.java:173)
>       at 
> org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:776)
>
> I'm at a loss as to why this is showing up in the logs.
> -sd
>
> On Mon, Jun 13, 2011 at 3:58 PM, Sasha Dolgy  wrote:
>> hm.  that's not it.  we've been using a non-standard jmx port for some 
>> time
>>
>> i've dropped the keyspace and recreated ...
>>
>> wonder if that'll help
>>
>> On Mon, Jun 13, 2011 at 3:57 PM, Tyler Hobbs  wrote:
>>> On Mon, Jun 13, 2011 at 8:41 AM, Sasha Dolgy  wrote:

 I recall there being a discussion about a default port changing from
 0.7.x to 0.8.x ...this was JMX, correct?  Or were there others.
>>>
>>> Yes, the default JMX port changed from 8080 to 7199.  I don't think there
>>> were any others.
>


Re: get_indexed_slices ~ simple map-reduce

2011-06-14 Thread Michal Augustýn
Thank you!

I have one more question ;-) If I use regular "get" function then I
can be sure that it takes ~5ms. So I suppose that if I use
"get_indexed_slices" function then the response time depends on how
many rows match the most selected equality predicate. Am I right?

Augi

2011/6/14 aaron morton :
> From a quick read of the code in o.a.c.db.ColumnFamilyStore.scan()...
>
> Candidate rows are first read by applying the most selected equality 
> predicate.
>
> From those candidate rows...
>
> 1) If the SlicePredicate has a SliceRange the query execution will read all 
> columns for the candidate row  if the byte size of the largest tracked row is 
> less than column_index_size_in_kb config setting (defaults to 64K). Meaning 
> if no more than 1 column index page of columns is (probably) going to be 
> read, they will all be read.
>
> 2) Otherwise if the query will read the columns specified by the SliceRange.
>
> 3) If the SlicePredicate uses a list of columns names, those columns and the 
> ones referenced in the IndexExpressions (except the one selected as the 
> primary pivot above) are read from disk.
>
> If additional columns are needed (in case 2 above) they are read in a 
> separate reads from the candidate row.
>
> Then when applying the SlicePredicate to produce the final projection into 
> the result set, all the columns required to satisfy the filter will be in 
> memory.
>
>
> So, yes it reads just the columns from disk you you ask for. Unless it thinks 
> it will take no more work to read more.
>
> Hope that helps.
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 13 Jun 2011, at 08:34, Michal Augustýn wrote:
>
>> Hi,
>>
>> as I wrote, I don't want to install Hadoop etc. - I want just to use
>> the Thrift API. The core of my question is how does get_indexed_slices
>> function work.
>>
>> I know that it must get all keys using equality expression firstly -
>> but what about additional expressions? Does Cassandra fetch whole
>> filtered rows, or just columns used in additional filtering
>> expression?
>>
>> Thanks!
>>
>> Augi
>>
>> 2011/6/12 aaron morton :
>>> Not exactly sure what you mean here, all data access is through the thrift
>>> API unless you code java and embed cassandra in your app.
>>> As well as Pig support there is also Hive support in brisk (which will also
>>> have Pig support soon) http://www.datastax.com/products/brisk
>>> Can you provide some more info on the use case ? Personally if you have a
>>> read query you know you need to support, I would consider supporting it in
>>> the data model without secondary indexes.
>>> Cheers
>>>
>>> -
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> On 11 Jun 2011, at 19:23, Michal Augustýn wrote:
>>>
>>> Hi all,
>>>
>>> I'm thinking of get_indexed_slices function as a simple map-reduce job
>>> (that just maps) - am I right?
>>>
>>> Well, I would like to be able to run simple queries on values but I
>>> don't want to install Hadoop, write map-reduce jobs in Java (the whole
>>> application is in C# and I don't want to introduce new development
>>> stack - maybe Pig would help) and have some second interface to
>>> Cassandra (in addition to Thrift). So secondary indexes seem to be
>>> rescue for me. I would have just one indexed column that will have
>>> day-timestamp value (~100k items per day) and the equality expression
>>> for this column would be in each query (and I would add more ad-hoc
>>> expressions).
>>> Will this scenario work or is there some issue I could run in?
>>>
>>> Thanks!
>>>
>>> Augi
>>>
>>>
>
>


Re: odd logs after repair

2011-06-14 Thread Sasha Dolgy
Hi Sylvain,

I verified on all nodes with nodetool version that they are 0.8 and have
even restarted nodes.  Still persists.  The four nodes all report similar
errors about the other nodes.

When i upgraded to 0.8 maybe there were relics about the keyspace that say
it's from an earlier version?

I need to create a new keyspace to see if that fixes the error
On Jun 14, 2011 10:08 AM, "Sylvain Lebresne"  wrote:
> The exception itself is a bug (I've created
> https://issues.apache.org/jira/browse/CASSANDRA-2767 to fix it).
>
> However, the important message is the previous one (Even if the
> exception was not thrown, repair wouldn't be able to work correctly,
> so the fact that the exception is thrown is not such a big deal).
> Apparently, from the standpoint of whomever node this logs is from,
> the node 10.128.34.18 is still running 0.7. You should check if it is
> the case (restarting 10.128.34.18 and look for something like
> 'Cassandra version: 0.8.0' is one solution). If the does does run
> 0.8.0 and you still get this error, then it would point to a problem
> with our detection of the nodes.
>
> --
> Sylvain
>
> On Tue, Jun 14, 2011 at 9:55 AM, Sasha Dolgy  wrote:
>> Hi ...
>>
>> Does anyone else see these type of INFO messages in their log files,
>> or is i just me..?
>>
>> INFO [manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec] 2011-06-13
>> 21:28:39,877 AntiEntropyService.java (line 177) Excluding
>> /10.128.34.18 from repair because it is on version 0.7 or sooner. You
>> should consider updating this node before running repair again.
>> ERROR [manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec] 2011-06-13
>> 21:28:39,877 AbstractCassandraDaemon.java (line 113) Fatal exception
>> in thread Thread[manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec,5,RMI
>> Runtime]
>> java.util.ConcurrentModificationException
>>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>>   at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>>   at
org.apache.cassandra.service.AntiEntropyService.getNeighbors(AntiEntropyService.java:173)
>>   at
org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:776)
>>
>> I'm at a loss as to why this is showing up in the logs.
>> -sd
>>
>> On Mon, Jun 13, 2011 at 3:58 PM, Sasha Dolgy  wrote:
>>> hm.  that's not it.  we've been using a non-standard jmx port for some
time
>>>
>>> i've dropped the keyspace and recreated ...
>>>
>>> wonder if that'll help
>>>
>>> On Mon, Jun 13, 2011 at 3:57 PM, Tyler Hobbs  wrote:
 On Mon, Jun 13, 2011 at 8:41 AM, Sasha Dolgy  wrote:
>
> I recall there being a discussion about a default port changing from
> 0.7.x to 0.8.x ...this was JMX, correct?  Or were there others.

 Yes, the default JMX port changed from 8080 to 7199.  I don't think
there
 were any others.
>>


Re: odd logs after repair

2011-06-14 Thread Sylvain Lebresne
Could you open a ticket then please ?

--
Sylvain

On Tue, Jun 14, 2011 at 10:25 AM, Sasha Dolgy  wrote:
> Hi Sylvain,
>
> I verified on all nodes with nodetool version that they are 0.8 and have
> even restarted nodes.  Still persists.  The four nodes all report similar
> errors about the other nodes.
>
> When i upgraded to 0.8 maybe there were relics about the keyspace that say
> it's from an earlier version?
>
> I need to create a new keyspace to see if that fixes the error
>
> On Jun 14, 2011 10:08 AM, "Sylvain Lebresne"  wrote:
>> The exception itself is a bug (I've created
>> https://issues.apache.org/jira/browse/CASSANDRA-2767 to fix it).
>>
>> However, the important message is the previous one (Even if the
>> exception was not thrown, repair wouldn't be able to work correctly,
>> so the fact that the exception is thrown is not such a big deal).
>> Apparently, from the standpoint of whomever node this logs is from,
>> the node 10.128.34.18 is still running 0.7. You should check if it is
>> the case (restarting 10.128.34.18 and look for something like
>> 'Cassandra version: 0.8.0' is one solution). If the does does run
>> 0.8.0 and you still get this error, then it would point to a problem
>> with our detection of the nodes.
>>
>> --
>> Sylvain
>>
>> On Tue, Jun 14, 2011 at 9:55 AM, Sasha Dolgy  wrote:
>>> Hi ...
>>>
>>> Does anyone else see these type of INFO messages in their log files,
>>> or is i just me..?
>>>
>>> INFO [manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec] 2011-06-13
>>> 21:28:39,877 AntiEntropyService.java (line 177) Excluding
>>> /10.128.34.18 from repair because it is on version 0.7 or sooner. You
>>> should consider updating this node before running repair again.
>>> ERROR [manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec] 2011-06-13
>>> 21:28:39,877 AbstractCassandraDaemon.java (line 113) Fatal exception
>>> in thread Thread[manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec,5,RMI
>>> Runtime]
>>> java.util.ConcurrentModificationException
>>>       at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>>>       at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>>>       at
>>> org.apache.cassandra.service.AntiEntropyService.getNeighbors(AntiEntropyService.java:173)
>>>       at
>>> org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:776)
>>>
>>> I'm at a loss as to why this is showing up in the logs.
>>> -sd
>>>
>>> On Mon, Jun 13, 2011 at 3:58 PM, Sasha Dolgy  wrote:
 hm.  that's not it.  we've been using a non-standard jmx port for some
 time

 i've dropped the keyspace and recreated ...

 wonder if that'll help

 On Mon, Jun 13, 2011 at 3:57 PM, Tyler Hobbs  wrote:
> On Mon, Jun 13, 2011 at 8:41 AM, Sasha Dolgy  wrote:
>>
>> I recall there being a discussion about a default port changing from
>> 0.7.x to 0.8.x ...this was JMX, correct?  Or were there others.
>
> Yes, the default JMX port changed from 8080 to 7199.  I don't think
> there
> were any others.
>>>
>


Re: odd logs after repair

2011-06-14 Thread Sasha Dolgy
https://issues.apache.org/jira/browse/CASSANDRA-2768

On Tue, Jun 14, 2011 at 10:55 AM, Sylvain Lebresne  wrote:
> Could you open a ticket then please ?
>
> --
> Sylvain
>
> On Tue, Jun 14, 2011 at 10:25 AM, Sasha Dolgy  wrote:
>> Hi Sylvain,
>>
>> I verified on all nodes with nodetool version that they are 0.8 and have
>> even restarted nodes.  Still persists.  The four nodes all report similar
>> errors about the other nodes.
>>
>> When i upgraded to 0.8 maybe there were relics about the keyspace that say
>> it's from an earlier version?
>>
>> I need to create a new keyspace to see if that fixes the error
>>
>> On Jun 14, 2011 10:08 AM, "Sylvain Lebresne"  wrote:
>>> The exception itself is a bug (I've created
>>> https://issues.apache.org/jira/browse/CASSANDRA-2767 to fix it).
>>>
>>> However, the important message is the previous one (Even if the
>>> exception was not thrown, repair wouldn't be able to work correctly,
>>> so the fact that the exception is thrown is not such a big deal).
>>> Apparently, from the standpoint of whomever node this logs is from,
>>> the node 10.128.34.18 is still running 0.7. You should check if it is
>>> the case (restarting 10.128.34.18 and look for something like
>>> 'Cassandra version: 0.8.0' is one solution). If the does does run
>>> 0.8.0 and you still get this error, then it would point to a problem
>>> with our detection of the nodes.
>>>
>>> --
>>> Sylvain
>>>
>>> On Tue, Jun 14, 2011 at 9:55 AM, Sasha Dolgy  wrote:
 Hi ...

 Does anyone else see these type of INFO messages in their log files,
 or is i just me..?

 INFO [manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec] 2011-06-13
 21:28:39,877 AntiEntropyService.java (line 177) Excluding
 /10.128.34.18 from repair because it is on version 0.7 or sooner. You
 should consider updating this node before running repair again.
 ERROR [manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec] 2011-06-13
 21:28:39,877 AbstractCassandraDaemon.java (line 113) Fatal exception
 in thread Thread[manual-repair-1c6b33bc-ef14-4ec8-94f6-f1464ec8bdec,5,RMI
 Runtime]
 java.util.ConcurrentModificationException
       at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
       at java.util.HashMap$KeyIterator.next(HashMap.java:828)
       at
 org.apache.cassandra.service.AntiEntropyService.getNeighbors(AntiEntropyService.java:173)
       at
 org.apache.cassandra.service.AntiEntropyService$RepairSession.run(AntiEntropyService.java:776)

 I'm at a loss as to why this is showing up in the logs.
 -sd

 On Mon, Jun 13, 2011 at 3:58 PM, Sasha Dolgy  wrote:
> hm.  that's not it.  we've been using a non-standard jmx port for some
> time
>
> i've dropped the keyspace and recreated ...
>
> wonder if that'll help
>
> On Mon, Jun 13, 2011 at 3:57 PM, Tyler Hobbs  wrote:
>> On Mon, Jun 13, 2011 at 8:41 AM, Sasha Dolgy  wrote:
>>>
>>> I recall there being a discussion about a default port changing from
>>> 0.7.x to 0.8.x ...this was JMX, correct?  Or were there others.
>>
>> Yes, the default JMX port changed from 8080 to 7199.  I don't think
>> there
>> were any others.

>>
>



-- 
Sasha Dolgy
sasha.do...@gmail.com


Re: repair and amount of transfers

2011-06-14 Thread Terje Marthinussen
Ah..

I just found Cassandra-2698 (I thought I had seen something about this)...

I guess that means I have too see if I can find time to investigate if I
have a reproducible case?

Terje

On Tue, Jun 14, 2011 at 4:21 PM, Terje Marthinussen  wrote:

> Hi,
>
> I have been testing repairs a bit in different ways on 0.8.0 and I am
> curious on what to really expect in terms of data transferred.
>
> I would expect my data to be fairly consistent in this case from the start.
> More than a billion supercolumns, but there was no errors in feed and we
> have seen minimal amounts of read repair going on while doing a complete
> scan of the data for consistency checking. As such, I would also expect
> repair to finish reasonably fast.
>
> On some nodes, it finishes in a couple of hours, but other nodes it is
> taking more than 12 hours and I see some 65GB of data streamed to the node
> which surprises me as I am pretty sure that it is not that out of sync.
>
> Not sure how much the merkle trees are actually reducing what needs to be
> streamed though.
>
> What should we expect to see if this works?
>
> Regards,
> Terje
>


RE: Are data migration tools for Cassandra exist?

2011-06-14 Thread Artem Orobets
Thank you for your answer.
We made investigation of Cassandra architecture, and we interested in 
approaches for solving this problem.

From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Sunday, June 12, 2011 5:51 AM
To: user@cassandra.apache.org
Subject: Re: Are data migration tools for Cassandra exist?

Depends on your scale, you can either code something yourself through the API 
or take advantage of the Hadoop integration and run jobs that read and write 
the data back. In either case you can change the code first to write to the new 
column as well as the old, then update all existing data, then change the code 
to read from the new field.

When you change the comparator the existing data is not migrated. Some changes 
are backwards compatible, e.g. moving from BytesType to any other type, 
AsciiType to UTF8Type, moving from LongType to IntegerType.

What sort of change did you want to do?

Hope that helps.

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 10 Jun 2011, at 21:21, Artem Orobets wrote:


If my application works in production and I change structure of my data (e.g. 
type of  column name)
I will need to process all my stored data.
As variant I can create new column family and import legacy data.

I think, that is typical task, so tool for doing this should exist, but I can't 
find anything.
Are there any tools or proved approaches to solve this task ?



Re: Is this the proper use of OPP?

2011-06-14 Thread Eric tamme
I would point you to this article, it does a good job describing OPP
and pretty much answers the specific questions you asked.

http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

-Eric


On Mon, Jun 13, 2011 at 5:06 PM, AJ  wrote:
> I'm just becoming aware of the restrictions of using an OPP as compared to
> Random.  Please let me know if I understand this correctly.
>
> First off, if using the OPP only for an increased performance of range
> queries, then it will probably be very hard to predict if you will end up
> with hotspots or not and thus where and even how the data may be clustered
> together in a particular node.  This is because all the various keys of the
> various CFs may or may not have any correlation with one another.  So, in
> effect, you just have a big mess of keys of various ranges and formats, but
> they all are partitioned according to one global set of tokens that apply to
> ALL CFs of ALL keyspaces.
>
> [main reason for post below...]
> OTOH, if you want to use OPP to purposely cluster certain data together on
> specific nodes, such as for geographic partitioning, then you have to choose
> a prefix for all of the keys of ALL CFs and ALL keyspaces!  This is because
> they will all be partitioned based on the tokens assigned to the nodes.
>  IOW, if I had two datacenters, one in the US and another in Europe, then
> for all rows in all KSs and in all CFs, I would need to prepend a prefix to
> the keys, such as "US:" and "EU:".  The problem is I may not want ALL of my
> CFs to be partitioned this way; only specific ones.  Also, it may be very
> difficult if not impossible for all keys of all keyspaces and CFs to use
> keys of this form.  I'm not sure if Cass is designed for this.
>
> However, if using the random partitioner, then there is no problem.  You can
> use any key of any type you want (UTF8, Long, etc.) since they are all
> hashed before deciding which node gets the key/row.
>
> Do I understand things correctly or am I missing something?  Is Cass
> designed to use OPP this way or am I hacking it?  If so, is there an
> acceptable way to do geographic partitioning?
>
> Also, what is OPP really good for?
>
> Thanks!
>


Re: repair and amount of transfers

2011-06-14 Thread Peter Schuller
> I just found Cassandra-2698 (I thought I had seen something about this)...

There is also the other bug that causes repair to transfer data from
all CF:s rather than just the one being repaired. This could be
affecting you if you're doing repair of individual CF:s rather than
everything at the same time.

-- 
/ Peter Schuller


Re: Is this the proper use of OPP?

2011-06-14 Thread AJ
Thanks.  I found that article later.  I was definitely off-base with 
respect to OPP.  Random partitioning is pretty much the way to go and 
datastax has a good article on geographic distribution: 
http://www.datastax.com/docs/0.8/operations/datacenter


Sorry for the long pointless post previously.  But, FWIW, I don't see 
much use for OPP other than the corner case of a cluster consisting on 1 
ks and 1 cf, such as an index.  I will have to read Dominic's post on 
having multiple Cass clusters running on the same nodes.


On 6/14/2011 4:46 AM, Eric tamme wrote:

I would point you to this article, it does a good job describing OPP
and pretty much answers the specific questions you asked.

http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

-Eric


On Mon, Jun 13, 2011 at 5:06 PM, AJ  wrote:

I'm just becoming aware of the restrictions of using an OPP as compared to
Random.  Please let me know if I understand this correctly.

First off, if using the OPP only for an increased performance of range
queries, then it will probably be very hard to predict if you will end up
with hotspots or not and thus where and even how the data may be clustered
together in a particular node.  This is because all the various keys of the
various CFs may or may not have any correlation with one another.  So, in
effect, you just have a big mess of keys of various ranges and formats, but
they all are partitioned according to one global set of tokens that apply to
ALL CFs of ALL keyspaces.

[main reason for post below...]
OTOH, if you want to use OPP to purposely cluster certain data together on
specific nodes, such as for geographic partitioning, then you have to choose
a prefix for all of the keys of ALL CFs and ALL keyspaces!  This is because
they will all be partitioned based on the tokens assigned to the nodes.
  IOW, if I had two datacenters, one in the US and another in Europe, then
for all rows in all KSs and in all CFs, I would need to prepend a prefix to
the keys, such as "US:" and "EU:".  The problem is I may not want ALL of my
CFs to be partitioned this way; only specific ones.  Also, it may be very
difficult if not impossible for all keys of all keyspaces and CFs to use
keys of this form.  I'm not sure if Cass is designed for this.

However, if using the random partitioner, then there is no problem.  You can
use any key of any type you want (UTF8, Long, etc.) since they are all
hashed before deciding which node gets the key/row.

Do I understand things correctly or am I missing something?  Is Cass
designed to use OPP this way or am I hacking it?  If so, is there an
acceptable way to do geographic partitioning?

Also, what is OPP really good for?

Thanks!





cql/secondary indexes - select in

2011-06-14 Thread Bill
I was wondering if there are plans for (or any interest in) an IN 
operator for CQL/Secondary Indexes?


I have a use case to pull back N keys on an index and rather than 
perform N selects would like to do this


SELECT ... WHERE KEY = keyname AND colname IN [val1,,..]

Bill




Re: repair and amount of transfers

2011-06-14 Thread Jonathan Ellis
that one's done for 0.8.1: https://issues.apache.org/jira/browse/CASSANDRA-2280

On Tue, Jun 14, 2011 at 5:56 AM, Peter Schuller
 wrote:
>> I just found Cassandra-2698 (I thought I had seen something about this)...
>
> There is also the other bug that causes repair to transfer data from
> all CF:s rather than just the one being repaired. This could be
> affecting you if you're doing repair of individual CF:s rather than
> everything at the same time.
>
> --
> / Peter Schuller
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: cql/secondary indexes - select in

2011-06-14 Thread Jonathan Ellis
We gave this a try in
https://issues.apache.org/jira/browse/CASSANDRA-2591 -- it turns out
it's not a good fit for the CQL QueryProcessor. We really need to be
able to push more complex queries to the index nodes
(https://issues.apache.org/jira/browse/CASSANDRA-1598).

So, we would still like to do this eventually but probably not soon
because of the difficulty involved.

On Tue, Jun 14, 2011 at 7:08 AM, Bill  wrote:
> I was wondering if there are plans for (or any interest in) an IN operator
> for CQL/Secondary Indexes?
>
> I have a use case to pull back N keys on an index and rather than perform N
> selects would like to do this
>
> SELECT ... WHERE KEY = keyname AND colname IN [val1,,..]
>
> Bill
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


New web client & future API

2011-06-14 Thread Markus Wiesenbacher | Codefreun.de

Hi,

what is the future API for Cassandra? Thrift, Avro, CQL?

I just released an early version of my web client 
(http://www.codefreun.de/apollo) which is Thrift-based, and therefore I would 
like to know what the future is ...

Many thanks
MW


Re: New web client & future API

2011-06-14 Thread Sasha Dolgy
Your application is built with the thrift bindings and not with a
higher level client like Hector?

On Tue, Jun 14, 2011 at 3:42 PM, Markus Wiesenbacher | Codefreun.de
 wrote:
>
> Hi,
>
> what is the future API for Cassandra? Thrift, Avro, CQL?
>
> I just released an early version of my web client
> (http://www.codefreun.de/apollo) which is Thrift-based, and therefore I
> would like to know what the future is ...
>
> Many thanks
> MW
>



-- 
Sasha Dolgy
sasha.do...@gmail.com


Re: New web client & future API

2011-06-14 Thread Victor Kabdebon
Hello Markus,

Actually from what I understood (please correct me if I am wrong) CQL is
based on Thrift / Avro.

Victor Kabdebon

2011/6/14 Markus Wiesenbacher | Codefreun.de 

>
> Hi,
>
> what is the future API for Cassandra? Thrift, Avro, CQL?
>
> I just released an early version of my web client 
> (
> http://www.codefreun.de/apollo) which is Thrift-based, and therefore I
> would like to know what the future is ...
>
> Many thanks
> MW
>


Cassandra Statistics and Metrics

2011-06-14 Thread Marcos Ortiz

Regards to all.
My team and me here on the University are working on a generic solution 
for Monitoring and Capacity Planning for Open Sources Databases, and one 
of the NoSQL db that we choosed to give it support is Cassandra.
Where I can find all the metrics and statistics of Cassandra? I'm 
thinking for example:

- Available space
- Number of CF
and all kind of metrics

We are using for this development: Python + Django + Twisted + Orbited + 
jQuery. The idea behind is to build a Comet-based web application on top 
of these technologies.

Any advice is welcome

--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186
  



Re: New web client & future API

2011-06-14 Thread Markus Wiesenbacher | Codefreun.de
Yes, I wanted to start from the base ...


Am 14.06.2011 um 15:48 schrieb Sasha Dolgy :

> Your application is built with the thrift bindings and not with a
> higher level client like Hector?
> 
> On Tue, Jun 14, 2011 at 3:42 PM, Markus Wiesenbacher | Codefreun.de
>  wrote:
>> 
>> Hi,
>> 
>> what is the future API for Cassandra? Thrift, Avro, CQL?
>> 
>> I just released an early version of my web client
>> (http://www.codefreun.de/apollo) which is Thrift-based, and therefore I
>> would like to know what the future is ...
>> 
>> Many thanks
>> MW
>> 
> 
> 
> 
> -- 
> Sasha Dolgy
> sasha.do...@gmail.com


Re: Cassandra Statistics and Metrics

2011-06-14 Thread Viktor Jevdokimov
We're using open source monitoring solution Zabbix from
http://www.zabbix.com/ using zapcat - not only for Cassandra but for the
whole system.

As MX4J tools plugin is supported by Cassandra, support of zapcat in
Cassandra by default is welcome - we have to use a wrapper to start zapcat
agent.

2011/6/14 Marcos Ortiz 

> Regards to all.
> My team and me here on the University are working on a generic solution for
> Monitoring and Capacity Planning for Open Sources Databases, and one of the
> NoSQL db that we choosed to give it support is Cassandra.
> Where I can find all the metrics and statistics of Cassandra? I'm thinking
> for example:
> - Available space
> - Number of CF
> and all kind of metrics
>
> We are using for this development: Python + Django + Twisted + Orbited +
> jQuery. The idea behind is to build a Comet-based web application on top of
> these technologies.
> Any advice is welcome
>
> --
> Marcos Luís Ortíz Valmaseda
>  Software Engineer (UCI)
>  http://marcosluis2186.posterous.com
>  http://twitter.com/marcosluis2186
>
>


Re: Cassandra Statistics and Metrics

2011-06-14 Thread Marcos Ortiz

Where I can find the source code?

El 6/14/2011 10:13 AM, Viktor Jevdokimov escribió:
We're using open source monitoring solution Zabbix from 
http://www.zabbix.com/ using zapcat - not only for Cassandra but for 
the whole system.


As MX4J tools plugin is supported by Cassandra, support of zapcat in 
Cassandra by default is welcome - we have to use a wrapper to start 
zapcat agent.


2011/6/14 Marcos Ortiz mailto:mlor...@uci.cu>>

Regards to all.
My team and me here on the University are working on a generic
solution for Monitoring and Capacity Planning for Open Sources
Databases, and one of the NoSQL db that we choosed to give it
support is Cassandra.
Where I can find all the metrics and statistics of Cassandra? I'm
thinking for example:
- Available space
- Number of CF
and all kind of metrics

We are using for this development: Python + Django + Twisted +
Orbited + jQuery. The idea behind is to build a Comet-based web
application on top of these technologies.
Any advice is welcome

-- 
Marcos Luís Ortíz Valmaseda

 Software Engineer (UCI)
http://marcosluis2186.posterous.com
http://twitter.com/marcosluis2186




--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186
  



Re: Cassandra Statistics and Metrics

2011-06-14 Thread Dan Kuebrich
Here's what people usually monitor from munin (and how they get at it):
https://github.com/jbellis/cassandra-munin-plugins .

Sounds a lot like what these guys are doing (even the stack?):
http://datadoghq.com/

On Tue, Jun 14, 2011 at 10:13 AM, Viktor Jevdokimov
wrote:

> We're using open source monitoring solution Zabbix from
> http://www.zabbix.com/ using zapcat - not only for Cassandra but for the
> whole system.
>
> As MX4J tools plugin is supported by Cassandra, support of zapcat in
> Cassandra by default is welcome - we have to use a wrapper to start zapcat
> agent.
>
>
> 2011/6/14 Marcos Ortiz 
>
>> Regards to all.
>> My team and me here on the University are working on a generic solution
>> for Monitoring and Capacity Planning for Open Sources Databases, and one of
>> the NoSQL db that we choosed to give it support is Cassandra.
>> Where I can find all the metrics and statistics of Cassandra? I'm thinking
>> for example:
>> - Available space
>> - Number of CF
>> and all kind of metrics
>>
>> We are using for this development: Python + Django + Twisted + Orbited +
>> jQuery. The idea behind is to build a Comet-based web application on top of
>> these technologies.
>> Any advice is welcome
>>
>> --
>> Marcos Luís Ortíz Valmaseda
>>  Software Engineer (UCI)
>>  http://marcosluis2186.posterous.com
>>  http://twitter.com/marcosluis2186
>>
>>
>
>


Re: Cassandra Statistics and Metrics

2011-06-14 Thread Marcos Ortiz
We are thinking a Web 2.0 application, so Munin was not built with these 
thougths in mind.

I will be reviewing the datadoghq site.
Regards

El 6/14/2011 10:23 AM, Dan Kuebrich escribió:
Here's what people usually monitor from munin (and how they get at 
it): https://github.com/jbellis/cassandra-munin-plugins .


Sounds a lot like what these guys are doing (even the stack?): 
http://datadoghq.com/


On Tue, Jun 14, 2011 at 10:13 AM, Viktor Jevdokimov 
mailto:vjevdoki...@gmail.com>>


We're using open source monitoring solution Zabbix from
http://www.zabbix.com/ using zapcat - not only for Cassandra but
for the whole system.

As MX4J tools plugin is supported by Cassandra, support of zapcat
in Cassandra by default is welcome - we have to use a wrapper to
start zapcat agent.


2011/6/14 Marcos Ortiz mailto:mlor...@uci.cu>>

Regards to all.
My team and me here on the University are working on a generic
solution for Monitoring and Capacity Planning for Open Sources
Databases, and one of the NoSQL db that we choosed to give it
support is Cassandra.
Where I can find all the metrics and statistics of Cassandra?
I'm thinking for example:
- Available space
- Number of CF
and all kind of metrics

We are using for this development: Python + Django + Twisted +
Orbited + jQuery. The idea behind is to build a Comet-based
web application on top of these technologies.
Any advice is welcome

-- 
Marcos Luís Ortíz Valmaseda

 Software Engineer (UCI)
http://marcosluis2186.posterous.com
http://twitter.com/marcosluis2186





--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186
  



Cassandra scaling problem in virtualized environment

2011-06-14 Thread Schuilenga, Jan Taeke
Hi All, 

We are having issues testing Cassandra in a virtualized environment
(Vmware ESX). 
Our challenge is to combine a  high number of concurrent users with a
very low maximum response time. 
Immediately we ran into a problem with scalability where our performance
(Trx per sec) unexpectedly degrades after adding nodes without
overcommitting host cpu resources too much as far as we can tell.
Therefore we are looking for bestpractices or anybody with experiences
with cassandra in a similar environment to help us.
So far we only found the following article which hasn't helped so far:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Node-ad
ded-no-performance-boost-are-the-tokens-correct-td6228872.html

Our current test setup using the java version of the cassandra load tool
consists of:
Hardware: Vmware ESX cluster with IBM 3850 (4x dual core cpu) Hosts on
Compellent FC SAN for storage 
Cassandra: 3-6 node 2vCpu Centos guest boxes (RF=2) 

Jan-Taeke Schuilenga


RE: Docs: "Why do deleted keys show up during range scans?"

2011-06-14 Thread Jeremiah Jordan
I am pretty sure how Cassandra works will make sense to you if you think
of it that way, that rows do not get deleted, columns get deleted.
While you can delete a row, if I understand correctly, what happens is a
tombstone is created which matches every column, so in effect it is
deleting the columns, not the whole row.  A row key will not be
forgotten/deleted until there are no columns or tombstones which
reference it.  Until there are no references to that row key in any
SSTables you can still get that key back from the API.

-Jeremiah

-Original Message-
From: AJ [mailto:a...@dude.podzone.net] 
Sent: Monday, June 13, 2011 12:11 PM
To: user@cassandra.apache.org
Subject: Re: Docs: "Why do deleted keys show up during range scans?"

On 6/13/2011 10:14 AM, Stephen Connolly wrote:
>
> store the query inverted.
>
> that way empty ->  deleted
>
I don't know what that means... get the other columns?  Can you
elaborate?  Is there docs for this or is this a hack/workaround?

> the tombstones are stored for each column that had data IIRC... but at

> this point my grok of C* is lacking
I suspected this, but wasn't sure.  It sounds like when a row is
deleted, a tombstone is not "attached" to the row, but to each column???
So, if all columns are deleted then the row is considered deleted?
Hmmm, that doesn't sound right, but that doesn't mean it isn't ! ;o)


RE: Docs: "Why do deleted keys show up during range scans?"

2011-06-14 Thread Jeremiah Jordan
Also, tombstone's are not "attached" anywhere.  A tombstone is just a
column with special value which says "I was deleted".  And I am pretty
sure they go into SSTables etc the exact same way regular columns do.

-Original Message-
From: Jeremiah Jordan [mailto:jeremiah.jor...@morningstar.com] 
Sent: Tuesday, June 14, 2011 11:22 AM
To: user@cassandra.apache.org
Subject: RE: Docs: "Why do deleted keys show up during range scans?"

I am pretty sure how Cassandra works will make sense to you if you think
of it that way, that rows do not get deleted, columns get deleted.
While you can delete a row, if I understand correctly, what happens is a
tombstone is created which matches every column, so in effect it is
deleting the columns, not the whole row.  A row key will not be
forgotten/deleted until there are no columns or tombstones which
reference it.  Until there are no references to that row key in any
SSTables you can still get that key back from the API.

-Jeremiah

-Original Message-
From: AJ [mailto:a...@dude.podzone.net]
Sent: Monday, June 13, 2011 12:11 PM
To: user@cassandra.apache.org
Subject: Re: Docs: "Why do deleted keys show up during range scans?"

On 6/13/2011 10:14 AM, Stephen Connolly wrote:
>
> store the query inverted.
>
> that way empty ->  deleted
>
I don't know what that means... get the other columns?  Can you
elaborate?  Is there docs for this or is this a hack/workaround?

> the tombstones are stored for each column that had data IIRC... but at

> this point my grok of C* is lacking
I suspected this, but wasn't sure.  It sounds like when a row is
deleted, a tombstone is not "attached" to the row, but to each column???
So, if all columns are deleted then the row is considered deleted?
Hmmm, that doesn't sound right, but that doesn't mean it isn't ! ;o)


Re: Cassandra scaling problem in virtualized environment

2011-06-14 Thread Ryan King
On Tue, Jun 14, 2011 at 8:16 AM, Schuilenga, Jan Taeke
 wrote:
> Hi All,
>
> We are having issues testing Cassandra in a virtualized environment (Vmware
> ESX).
> Our challenge is to combine a  high number of concurrent users with a very
> low maximum response time.
> Immediately we ran into a problem with scalability where our performance
> (Trx per sec) unexpectedly degrades after adding nodes without
> overcommitting host cpu resources too much as far as we can tell.
>
> Therefore we are looking for bestpractices or anybody with experiences with
> cassandra in a similar environment to help us.
>
> So far we only found the following article which hasn’t helped so far:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Node-added-no-performance-boost-are-the-tokens-correct-td6228872.html
>
> Our current test setup using the java version of the cassandra load tool
> consists of:
> Hardware: Vmware ESX cluster with IBM 3850 (4x dual core cpu) Hosts on
> Compellent FC SAN for storage
> Cassandra: 3-6 node 2vCpu Centos guest boxes (RF=2)

This hardware profile isn't ideal for cassandra. You'll likely see
much better performance for your money on commodity hardware.

-ryan


Re: Migration question

2011-06-14 Thread Eric Czech
Thanks Aaron.  I'll make sure to copy the system tables.

Another thing -- do you have any suggestions on raid configurations for main
data drives?  We're looking at RAID5 and 10 and I can't seem to find a
convincing argument one way or the other.

Thanks again for your help.

On Mon, Jun 6, 2011 at 5:45 AM, aaron morton wrote:

> Sounds like you are OK to turn off the existing cluster first.
>
> Assuming so, deliver any hints using JMX then do a nodetool flush to write
> out all the memtables and checkpoint the commit logs. You can then copy the
> data directories.
>
> The System data directory contains the nodes token and the schema, you will
> want to copy this directory. You may also want to copy the cassandra.yaml or
> create new ones with the correct initial tokens.
>
> The nodes will sort themselves out when they start up and get new IP's, the
> important thing to them is the token.
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 6 Jun 2011, at 23:25, Eric Czech wrote:
>
> > Hi, I have a quick question about migrating a cluster.
> >
> > We have a cassandra cluster with 10 nodes that we'd like to move to a new
> DC and what I was hoping to do is just copy the SSTables for each node to a
> corresponding node in the new DC (the new cluster will also have 10 nodes).
>  Is there any reason that a straight file copy like this wouldn't work?  Do
> any system tables need to be moved as well or is there anything else that
> needs to be done?
> >
> > Thanks!
>
>


Re: one way to make counter delete work better

2011-06-14 Thread Sylvain Lebresne
Who assigns those epoch numbers ?
You need all nodes to agree on the epoch number somehow to have this work,
but then how do you maintain those in a partition tolerant distributed system ?

I may have missed some parts of your proposal but let me consider a scenario
that we have to be able to handle: consider two nodes A and B (RF=2) each in
one data center (DCA and DCB) and a counter c. Suppose you do a +2 increment
on c that both nodes get. Now let say you have a network split and the
connection
between your 2 data center fails. In DCA you delete c, only A gets it.
In DCB, you
do more increments on c (say +3), only B gets it. The partition can
last for hours.
For deletion to work, we would need that whenever the network
partition is resolved,
both node eventually agree on the value 3 (i.e, only the second increment).
I don't see how you could assign epoch numbers or anything to fix that.

--
Sylvain

On Mon, Jun 13, 2011 at 8:26 PM, Yang  wrote:
> ok, I think it's better to understand it this way, then it is really simple
> and intuitive:
> my proposed way of counter update can be simply seen as a combination of
> regular columns + current counter columns:
> regular column :  [ value: "wipes out every bucket to nil"   , clock: epoch
> number]
> then within each epoch, counter updates work as currently implemented
>
>
> On Mon, Jun 13, 2011 at 10:12 AM, Yang  wrote:
>>
>> I think this approach also works for your scenario:
>> I thought that the issue is only concerned with merging within the same
>> leader; but you pointed out
>> that a similar merging happens between leaders too, now I see that the
>> same rules on epoch number
>> also applies to inter-leader data merging, specifically in your case:
>>
>> everyone starts with epoch of 0, ( they should be same, if not, it also
>> works, we just consider them to be representing diffferent time snapshots of
>> the same counter state)
>> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
>> node A      delete    clock:  0.200
>> node B     add 2     clock:  0.300
>> node A    gets B's state:  add 2 clock 0.300, but rejects it because A has
>> already produced a delete, with epoch of 0, so A considers epoch 0 already
>> ended, it won't accept any replicated state with epoch < 1.
>> node B    gets A's delete  0.200,  it zeros its own count of "2", and
>> updates its future expected epoch to 1.
>> at this time, the state of system is:
>> node A     expected epoch =1  [A:nil] [B:nil]
>> same for node B
>>
>>
>> let's say we have following further writes:
>> node B  add 3  clock  1.400
>> node A adds 4  clock 1.500
>> node B receives A's add 4,   node B updates its copy of A
>> node A receives B's add 3,    updates its copy of B
>>
>> then state is:
>> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
>> node B same
>>
>>
>> generally I think it should be complete if we add the following rule for
>> inter-leader replication:
>> each leader keeps a var in memory (and also persist to sstable when
>> flushing)  expected_epoch , initially set to 0
>> node P does:
>> on receiving updates from  node Q
>>         if Q.expected_epoch > P.expected_epoch
>>               /** an epoch bump inherently means a previous delete, which
>> we probably missed , so we need to apply the delete
>>                   a delete is global to all leaders, so apply it on all my
>> replicas **/
>>              for all leaders in my vector
>>                   count = nil
>>
>>              P.expected_epoch =  Q.expected_epoch
>>         if Q.expected_epoch == P.expected_epoch
>>              update P's copy of Q according to standard rules
>>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is less
>> up to date than us, just ignore
>>
>> replicate_on_write(to Q):
>>       if  P.operation == delete
>>             P.expected_epoch ++
>>             set all my copies of all leaders to nil
>>       send to Q ( P.total , P.expected_epoch)
>>
>>
>>
>> overall I don't think delete being not commutative is a fundamental
>> blocker : regular columns are also not commutative, yet we achieve stable
>> result no matter what order they are applied, because of the ordering rule
>> used in reconciliation; here we just need to find a similar ordering rule.
>> the epoch thing could be a step on this direction.
>>
>> Thanks
>> Yang
>>
>>
>>
>> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis  wrote:
>>>
>>> I don't think that's bulletproof either.  For instance, what if the
>>> two adds go to replica 1 but the delete to replica 2?
>>>
>>> Bottom line (and this was discussed on the original
>>> delete-for-counters ticket,
>>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
>>> are not fully commutative which makes them fragile.
>>>
>>> On Mon, Jun 13, 2011 at 10:54 AM, Yang  wrote:
>>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>>> > indicates, the problem with counter delete is  in scenarios like the
>>> > following:
>>

Re: possible 'coming back to life' bug with counters

2011-06-14 Thread Sylvain Lebresne
As listed here: http://wiki.apache.org/cassandra/Counters, counter deletion is
provided as a convenience for permanent deletion of counters but, because
of the design of counters, it is never safe to issue an increment on a
counter that
has been deleted (that is, you will experience back to life behavior
sometimes in
that case).
More precisely, you'd have to wait long enough after a deletion to start
incrementing the counter again. But in the worst cases, long enough is something
like gc_grace_seconds + major compaction.

This is *not* something that is likely to change anytime soon (I don't
think this is
fixable with the current design for counters).

--
Sylvain

On Sat, Jun 11, 2011 at 3:54 AM, David Hawthorne  wrote:
> Please take a look at this thread over in the hector-users mailing list:
> http://groups.google.com/group/hector-users/browse_thread/thread/99835159b9ea1766
> It looks as if the deleted columns are coming back to life when they
> shouldn't be.
> I don't want to open a bug on something if it's already got one that I just
> couldn't find when I scanned the list of open bugs.
> I'm using hector 0.8 against cassandra 0.8 release.  I can give you whatever
> logs or files you'd like.


Re: one way to make counter delete work better

2011-06-14 Thread Milind Parikh
If I understand this correctly, then the epoch integer would be generated by
each node. Since time always flows forward, the assumption would be, I
suppose, that the epochs would be tagged with the node that generated them
and additionally the counter would carry as much history as necessary (and
presumably not all history at all times).

Milind


On Tue, Jun 14, 2011 at 2:21 PM, Sylvain Lebresne wrote:

> Who assigns those epoch numbers ?
> You need all nodes to agree on the epoch number somehow to have this work,
> but then how do you maintain those in a partition tolerant distributed
> system ?
>
> I may have missed some parts of your proposal but let me consider a
> scenario
> that we have to be able to handle: consider two nodes A and B (RF=2) each
> in
> one data center (DCA and DCB) and a counter c. Suppose you do a +2
> increment
> on c that both nodes get. Now let say you have a network split and the
> connection
> between your 2 data center fails. In DCA you delete c, only A gets it.
> In DCB, you
> do more increments on c (say +3), only B gets it. The partition can
> last for hours.
> For deletion to work, we would need that whenever the network
> partition is resolved,
> both node eventually agree on the value 3 (i.e, only the second increment).
> I don't see how you could assign epoch numbers or anything to fix that.
>
> --
> Sylvain
>
> On Mon, Jun 13, 2011 at 8:26 PM, Yang  wrote:
> > ok, I think it's better to understand it this way, then it is really
> simple
> > and intuitive:
> > my proposed way of counter update can be simply seen as a combination of
> > regular columns + current counter columns:
> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
> epoch
> > number]
> > then within each epoch, counter updates work as currently implemented
> >
> >
> > On Mon, Jun 13, 2011 at 10:12 AM, Yang  wrote:
> >>
> >> I think this approach also works for your scenario:
> >> I thought that the issue is only concerned with merging within the same
> >> leader; but you pointed out
> >> that a similar merging happens between leaders too, now I see that the
> >> same rules on epoch number
> >> also applies to inter-leader data merging, specifically in your case:
> >>
> >> everyone starts with epoch of 0, ( they should be same, if not, it also
> >> works, we just consider them to be representing diffferent time
> snapshots of
> >> the same counter state)
> >> node A  add 1clock:  0.100  (epoch = 0, clock number = 100)
> >> node A  deleteclock:  0.200
> >> node B add 2 clock:  0.300
> >> node Agets B's state:  add 2 clock 0.300, but rejects it because A
> has
> >> already produced a delete, with epoch of 0, so A considers epoch 0
> already
> >> ended, it won't accept any replicated state with epoch < 1.
> >> node Bgets A's delete  0.200,  it zeros its own count of "2", and
> >> updates its future expected epoch to 1.
> >> at this time, the state of system is:
> >> node A expected epoch =1  [A:nil] [B:nil]
> >> same for node B
> >>
> >>
> >> let's say we have following further writes:
> >> node B  add 3  clock  1.400
> >> node A adds 4  clock 1.500
> >> node B receives A's add 4,   node B updates its copy of A
> >> node A receives B's add 3,updates its copy of B
> >>
> >> then state is:
> >> node A  , expected epoch == 1[A:4  clock=400] [B:3   clock=500]
> >> node B same
> >>
> >>
> >> generally I think it should be complete if we add the following rule for
> >> inter-leader replication:
> >> each leader keeps a var in memory (and also persist to sstable when
> >> flushing)  expected_epoch , initially set to 0
> >> node P does:
> >> on receiving updates from  node Q
> >> if Q.expected_epoch > P.expected_epoch
> >>   /** an epoch bump inherently means a previous delete,
> which
> >> we probably missed , so we need to apply the delete
> >>   a delete is global to all leaders, so apply it on all
> my
> >> replicas **/
> >>  for all leaders in my vector
> >>   count = nil
> >>
> >>  P.expected_epoch =  Q.expected_epoch
> >> if Q.expected_epoch == P.expected_epoch
> >>  update P's copy of Q according to standard rules
> >> /** if Q.expected_epoch < P.expected_epoch  , that means Q is
> less
> >> up to date than us, just ignore
> >>
> >> replicate_on_write(to Q):
> >>   if  P.operation == delete
> >> P.expected_epoch ++
> >> set all my copies of all leaders to nil
> >>   send to Q ( P.total , P.expected_epoch)
> >>
> >>
> >>
> >> overall I don't think delete being not commutative is a fundamental
> >> blocker : regular columns are also not commutative, yet we achieve
> stable
> >> result no matter what order they are applied, because of the ordering
> rule
> >> used in reconciliation; here we just need to find a similar ordering
> rule.
> >> the epoch thing could be a step on this direction.
> >>
> >> Thanks
> >> 

bring out your rpms...

2011-06-14 Thread Colin
Does anyone know where an rpm for 0.7.6-2 might be? (rhel)

 

I checked the datastax site and only see up to 0.7.6-1



Where is my data?

2011-06-14 Thread AJ
Is there an official deterministic formula to compute the various 
subsets of a given cluster that comprises a complete set of data 
(redundant rows ok)?  IOW, if multiple nodes become unavailable one at a 
time, at what point can I say <100% of my data is available?


Obviously, the method would have to take into consideration the ring 
layout along with the partition type, the # of nodes, 
replication_factor, replication strat, etc..


Thanks!


Re: get_indexed_slices ~ simple map-reduce

2011-06-14 Thread aaron morton
yes, just like a SELECT in SQL. With a better index match there is less data 
read off disk, less filter loops, and a faster the query.

btw, the read path in cassandra is generally non deterministic. It varies with 
respect to how many mutations the key has received over time, and how efficient 
the compaction process has been. Generally older rows will have more 
predictable performance.  Something I wrote once about the read and write path 
http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 14 Jun 2011, at 20:25, Michal Augustýn wrote:

> Thank you!
> 
> I have one more question ;-) If I use regular "get" function then I
> can be sure that it takes ~5ms. So I suppose that if I use
> "get_indexed_slices" function then the response time depends on how
> many rows match the most selected equality predicate. Am I right?
> 
> Augi
> 
> 2011/6/14 aaron morton :
>> From a quick read of the code in o.a.c.db.ColumnFamilyStore.scan()...
>> 
>> Candidate rows are first read by applying the most selected equality 
>> predicate.
>> 
>> From those candidate rows...
>> 
>> 1) If the SlicePredicate has a SliceRange the query execution will read all 
>> columns for the candidate row  if the byte size of the largest tracked row 
>> is less than column_index_size_in_kb config setting (defaults to 64K). 
>> Meaning if no more than 1 column index page of columns is (probably) going 
>> to be read, they will all be read.
>> 
>> 2) Otherwise if the query will read the columns specified by the SliceRange.
>> 
>> 3) If the SlicePredicate uses a list of columns names, those columns and the 
>> ones referenced in the IndexExpressions (except the one selected as the 
>> primary pivot above) are read from disk.
>> 
>> If additional columns are needed (in case 2 above) they are read in a 
>> separate reads from the candidate row.
>> 
>> Then when applying the SlicePredicate to produce the final projection into 
>> the result set, all the columns required to satisfy the filter will be in 
>> memory.
>> 
>> 
>> So, yes it reads just the columns from disk you you ask for. Unless it 
>> thinks it will take no more work to read more.
>> 
>> Hope that helps.
>> 
>> -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 13 Jun 2011, at 08:34, Michal Augustýn wrote:
>> 
>>> Hi,
>>> 
>>> as I wrote, I don't want to install Hadoop etc. - I want just to use
>>> the Thrift API. The core of my question is how does get_indexed_slices
>>> function work.
>>> 
>>> I know that it must get all keys using equality expression firstly -
>>> but what about additional expressions? Does Cassandra fetch whole
>>> filtered rows, or just columns used in additional filtering
>>> expression?
>>> 
>>> Thanks!
>>> 
>>> Augi
>>> 
>>> 2011/6/12 aaron morton :
 Not exactly sure what you mean here, all data access is through the thrift
 API unless you code java and embed cassandra in your app.
 As well as Pig support there is also Hive support in brisk (which will also
 have Pig support soon) http://www.datastax.com/products/brisk
 Can you provide some more info on the use case ? Personally if you have a
 read query you know you need to support, I would consider supporting it in
 the data model without secondary indexes.
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 On 11 Jun 2011, at 19:23, Michal Augustýn wrote:
 
 Hi all,
 
 I'm thinking of get_indexed_slices function as a simple map-reduce job
 (that just maps) - am I right?
 
 Well, I would like to be able to run simple queries on values but I
 don't want to install Hadoop, write map-reduce jobs in Java (the whole
 application is in C# and I don't want to introduce new development
 stack - maybe Pig would help) and have some second interface to
 Cassandra (in addition to Thrift). So secondary indexes seem to be
 rescue for me. I would have just one indexed column that will have
 day-timestamp value (~100k items per day) and the equality expression
 for this column would be in each query (and I would add more ad-hoc
 expressions).
 Will this scenario work or is there some issue I could run in?
 
 Thanks!
 
 Augi
 
 
>> 
>> 



Re: bring out your rpms...

2011-06-14 Thread Nate McCall
The 0.7.6-2 release was made over *-1 specifically to correct an issue
with debian packaging.

This keeps coming up though, so I'll probably just go ahead and roll a
0.7.6-2 for rpm.datastax.com so as not to confuse folks.


On Tue, Jun 14, 2011 at 4:19 PM, Colin  wrote:
> Does anyone know where an rpm for 0.7.6-2 might be? (rhel)
>
>
>
> I checked the datastax site and only see up to 0.7.6-1


Re: bring out your rpms...

2011-06-14 Thread Konstantin Naryshkin
You could try to roll your own. I managed to create a custom 0.8 RPM using the 
spec file from the redhat directory. First check out the source. Then edit the 
spec file with the following changes: 
Set the Version and Release variables appropriately. 
At the end of %install, add the following 2 lines: 
cp -p build/apache-cassandra-cql-*.jar %{buildroot}/usr/share/%{username}/lib 
cp -p build/apache-cassandra-thrift-*.jar 
%{buildroot}/usr/share/%{username}/lib 

To build the RPM (assuming that your system has appropriate RPM creation tools 
installed and configured): 
cd [ROOT OF THE CASSANDRA SOURCE] 
ant 
ant release 
[at this point you may want to go get a cup of coffee, since it takes me about 
10-15 minutes to build the release] 
copy apache-cassandra-*-src.tar.gz into your rpm SOURCE directory 
if the above tar contained SNAPSHOT in its name, rename it to remove the phrase 
'-SNAPSHOT' from both the name of the tar and the name of the directory inside 
the tar 
cd [ROOT OF THE CASSANDRA SOURCE]/redhat 
rpmbuild -ba apache-cassandra.spec 

Your rpm will be in your rpm RPMS directory. 

- Original Message -
From: "Colin"  
To: user@cassandra.apache.org 
Sent: Tuesday, June 14, 2011 9:19:17 PM 
Subject: bring out your rpms... 




Does anyone know where an rpm for 0.7.6-2 might be? (rhel) 



I checked the datastax site and only see up to 0.7.6-1 

Re: New web client & future API

2011-06-14 Thread aaron morton
AFAIK...

Avro is dead. 

Thrift is the current API and currently the only full featured API. 

CQL is a possible future API, given community support and development time it 
may become the only API. The initial release is not feature complete (e.g. 
missing some DDL  statements) and still uses thrift as the wire protocol. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 15 Jun 2011, at 02:01, Markus Wiesenbacher | Codefreun.de wrote:

> Yes, I wanted to start from the base ...
> 
> 
> Am 14.06.2011 um 15:48 schrieb Sasha Dolgy :
> 
>> Your application is built with the thrift bindings and not with a
>> higher level client like Hector?
>> 
>> On Tue, Jun 14, 2011 at 3:42 PM, Markus Wiesenbacher | Codefreun.de
>>  wrote:
>>> 
>>> Hi,
>>> 
>>> what is the future API for Cassandra? Thrift, Avro, CQL?
>>> 
>>> I just released an early version of my web client
>>> (http://www.codefreun.de/apollo) which is Thrift-based, and therefore I
>>> would like to know what the future is ...
>>> 
>>> Many thanks
>>> MW
>>> 
>> 
>> 
>> 
>> -- 
>> Sasha Dolgy
>> sasha.do...@gmail.com



RE: bring out your rpms...

2011-06-14 Thread Colin
Thanks Nate.  I appreciate it.

-Original Message-
From: Nate McCall [mailto:n...@datastax.com] 
Sent: Tuesday, June 14, 2011 4:52 PM
To: user@cassandra.apache.org
Subject: Re: bring out your rpms...

The 0.7.6-2 release was made over *-1 specifically to correct an issue with
debian packaging.

This keeps coming up though, so I'll probably just go ahead and roll a
0.7.6-2 for rpm.datastax.com so as not to confuse folks.


On Tue, Jun 14, 2011 at 4:19 PM, Colin  wrote:
> Does anyone know where an rpm for 0.7.6-2 might be? (rhel)
>
>
>
> I checked the datastax site and only see up to 0.7.6-1



Re: Docs: "Why do deleted keys show up during range scans?"

2011-06-14 Thread aaron morton
> While you can delete a row, if I understand correctly, what happens is a
> tombstone is created which matches every column, so in effect it is
> deleting the columns, not the whole row. 

A tombstone is created at the level of the delete, rather than for every 
column. Otherwise imagine deleting a row with 1 million columns.

Tombstones are created at the Column, Super Column and Row level. Deleting at 
the row level writes a row level tombstone. All these different tombstones are 
resolved during the read process. 

My understanding of "So to special case leaving out result entries for 
deletions, we would have to check the entire rest of the row to make sure there 
is no undeleted data anywhere else either (in which case leaving the key out 
would be an error)." is...

Resolving the predicate to determine if a row contains the specified columns is 
a (somewhat) bound operation. Determining if a row as ANY non deleted columns 
is a potentially unbound operation that could involve lots-o-io .  Imagine a 
row with 1 million columns, and the first 100,000 have been deleted. 

For each row in the result set you can say either :
 
1) It has 1 or more of the columns I requested.
2) It has none of the columns I requested. 
3) it has no columns, but cassandra decided it was too much work to 
conclusively prove that. Because after all I asked if it had some specific 
columns not if it had any columns.  

Hope that helps. 

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 15 Jun 2011, at 04:25, Jeremiah Jordan wrote:

> Also, tombstone's are not "attached" anywhere.  A tombstone is just a
> column with special value which says "I was deleted".  And I am pretty
> sure they go into SSTables etc the exact same way regular columns do.
> 
> -Original Message-
> From: Jeremiah Jordan [mailto:jeremiah.jor...@morningstar.com] 
> Sent: Tuesday, June 14, 2011 11:22 AM
> To: user@cassandra.apache.org
> Subject: RE: Docs: "Why do deleted keys show up during range scans?"
> 
> I am pretty sure how Cassandra works will make sense to you if you think
> of it that way, that rows do not get deleted, columns get deleted.
> While you can delete a row, if I understand correctly, what happens is a
> tombstone is created which matches every column, so in effect it is
> deleting the columns, not the whole row.  A row key will not be
> forgotten/deleted until there are no columns or tombstones which
> reference it.  Until there are no references to that row key in any
> SSTables you can still get that key back from the API.
> 
> -Jeremiah
> 
> -Original Message-
> From: AJ [mailto:a...@dude.podzone.net]
> Sent: Monday, June 13, 2011 12:11 PM
> To: user@cassandra.apache.org
> Subject: Re: Docs: "Why do deleted keys show up during range scans?"
> 
> On 6/13/2011 10:14 AM, Stephen Connolly wrote:
>> 
>> store the query inverted.
>> 
>> that way empty ->  deleted
>> 
> I don't know what that means... get the other columns?  Can you
> elaborate?  Is there docs for this or is this a hack/workaround?
> 
>> the tombstones are stored for each column that had data IIRC... but at
> 
>> this point my grok of C* is lacking
> I suspected this, but wasn't sure.  It sounds like when a row is
> deleted, a tombstone is not "attached" to the row, but to each column???
> So, if all columns are deleted then the row is considered deleted?
> Hmmm, that doesn't sound right, but that doesn't mean it isn't ! ;o)



Docs: Token Selection

2011-06-14 Thread AJ

This http://wiki.apache.org/cassandra/Operations#Token_selection  says:

"With NetworkTopologyStrategy, you should calculate the tokens the nodes 
in each DC independantly."


and gives the example:

DC1
node 1 = 0
node 2 = 85070591730234615865843651857942052864

DC2
node 3 = 1
node 4 = 85070591730234615865843651857942052865


So, according to the above, the token ranges would be (abbreviated nums):

DC1
node 1 = 0  Range: (8..4, 16], (0, 0]
node 2 = 8..4   Range: (0, 8..4]

DC2
node 3 = 1  Range: (8..5, 16], (0, 1]
node 4 = 8..5   Range: (1, 8..5]


If the above is correct, then I would be surprised as this paragraph is 
the only place were one would discover this and may be easy to miss... 
unless there's a doc buried somewhere in plain view that I missed.


So, have I interpreted this paragraph correctly?  Was this design to 
help keep data somewhat localized if that was important, such as a 
geographically dispersed DC?


Thanks!


Re: Docs: "Why do deleted keys show up during range scans?"

2011-06-14 Thread AJ

Thanks, but right now I'm thinking, RTFC ;o)

On 6/14/2011 4:37 PM, aaron morton wrote:

While you can delete a row, if I understand correctly, what happens is a
tombstone is created which matches every column, so in effect it is
deleting the columns, not the whole row.

A tombstone is created at the level of the delete, rather than for every 
column. Otherwise imagine deleting a row with 1 million columns.

Tombstones are created at the Column, Super Column and Row level. Deleting at 
the row level writes a row level tombstone. All these different tombstones are 
resolved during the read process.

My understanding of "So to special case leaving out result entries for deletions, we 
would have to check the entire rest of the row to make sure there is no undeleted data 
anywhere else either (in which case leaving the key out would be an error)." is...

Resolving the predicate to determine if a row contains the specified columns is 
a (somewhat) bound operation. Determining if a row as ANY non deleted columns 
is a potentially unbound operation that could involve lots-o-io .  Imagine a 
row with 1 million columns, and the first 100,000 have been deleted.

For each row in the result set you can say either :

1) It has 1 or more of the columns I requested.
2) It has none of the columns I requested.
3) it has no columns, but cassandra decided it was too much work to 
conclusively prove that. Because after all I asked if it had some specific 
columns not if it had any columns.

Hope that helps.

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 15 Jun 2011, at 04:25, Jeremiah Jordan wrote:


Also, tombstone's are not "attached" anywhere.  A tombstone is just a
column with special value which says "I was deleted".  And I am pretty
sure they go into SSTables etc the exact same way regular columns do.

-Original Message-
From: Jeremiah Jordan [mailto:jeremiah.jor...@morningstar.com]
Sent: Tuesday, June 14, 2011 11:22 AM
To: user@cassandra.apache.org
Subject: RE: Docs: "Why do deleted keys show up during range scans?"

I am pretty sure how Cassandra works will make sense to you if you think
of it that way, that rows do not get deleted, columns get deleted.
While you can delete a row, if I understand correctly, what happens is a
tombstone is created which matches every column, so in effect it is
deleting the columns, not the whole row.  A row key will not be
forgotten/deleted until there are no columns or tombstones which
reference it.  Until there are no references to that row key in any
SSTables you can still get that key back from the API.

-Jeremiah

-Original Message-
From: AJ [mailto:a...@dude.podzone.net]
Sent: Monday, June 13, 2011 12:11 PM
To: user@cassandra.apache.org
Subject: Re: Docs: "Why do deleted keys show up during range scans?"

On 6/13/2011 10:14 AM, Stephen Connolly wrote:

store the query inverted.

that way empty ->   deleted


I don't know what that means... get the other columns?  Can you
elaborate?  Is there docs for this or is this a hack/workaround?


the tombstones are stored for each column that had data IIRC... but at
this point my grok of C* is lacking

I suspected this, but wasn't sure.  It sounds like when a row is
deleted, a tombstone is not "attached" to the row, but to each column???
So, if all columns are deleted then the row is considered deleted?
Hmmm, that doesn't sound right, but that doesn't mean it isn't ! ;o)






Re: Docs: Token Selection

2011-06-14 Thread Vijay
Yes... Thats right...  If you are trying to say the below...

DC1
Node1 Owns 50%

(Ranges 8..4 -> 8..5 & 8..5 -> 0)

Node2 Owns 50%

(Ranges 0 -> 1 & 1 -> 8..4)


DC2
Node1 Owns 50%

(Ranges 8..5 -> 0 & 0 -> 1)

Node2 Owns 50%

(Ranges 1 -> 8..4 & 8..4 -> 8..5)


Regards,




On Tue, Jun 14, 2011 at 3:47 PM, AJ  wrote:

> This http://wiki.apache.org/cassandra/Operations#Token_selection  says:
>
> "With NetworkTopologyStrategy, you should calculate the tokens the nodes in
> each DC independantly."
>
> and gives the example:
>
> DC1
> node 1 = 0
> node 2 = 85070591730234615865843651857942052864
>
> DC2
> node 3 = 1
> node 4 = 85070591730234615865843651857942052865
>
>
> So, according to the above, the token ranges would be (abbreviated nums):
>
> DC1
> node 1 = 0  Range: (8..4, 16], (0, 0]
> node 2 = 8..4   Range: (0, 8..4]
>
> DC2
> node 3 = 1  Range: (8..5, 16], (0, 1]
> node 4 = 8..5   Range: (1, 8..5]
>
>
> If the above is correct, then I would be surprised as this paragraph is the
> only place were one would discover this and may be easy to miss... unless
> there's a doc buried somewhere in plain view that I missed.
>
> So, have I interpreted this paragraph correctly?  Was this design to help
> keep data somewhat localized if that was important, such as a geographically
> dispersed DC?
>
> Thanks!
>


When does it make sense to use TimeUUID?

2011-06-14 Thread Sameer Farooqui
I would like to store some timestamped user info in a Column Family with the
usernames as the row key and different timestamps as column names. Each user
might have a thousand timestamped data.

I understand that the ver 1 UUIDs that Cassandra combines the MAC address of
the computer generating the UUID with the number of 100-nanosecond intervals
since the beginning of the Gregorian calendar.

So, if user1 had data stored for an event at Jan 30, 2011/2:15pm and user2
had an event at the exact same time, the data could potentially be stored in
different column names? So, I would have to know the MAC of the generating
computer in order to do a column slice, right?

When does it make sense to use TimeUUID vs just a time string like
20110130141500 and comparator type UTF8?

- Sameer


RE: When does it make sense to use TimeUUID?

2011-06-14 Thread Kevin
TimeUUIDs should be used for data that is time-based and requires
uniqueness.

 

TimeUUID comparisons compare the time-based portion of the UUID. So no, you
do not need to know the MAC addresses. In fact, for languages that cannot
get to that low of a level to access a MAC address (like Java), the timeUUID
tools generate random data for that part of the UUID.

 

I don't understand your "user1"/"user2" scenario. The timeUUIDs in that
scenario wouldn't even come in to question because the columns would be in
two different rows since they pertain to two different users (unless they
are referencing some other Column Family where those same TimeUUIDs are rows
or columns in the same row).

 

A good example of something that timeUUIDs would be great for would be
friend requests. Regular time-based strings would not be sufficient in this
case since it's possible that two requests can be sent from two different
computers at the same time. Thus, you can store the requests as columns (or
super columns) each named by a timeUUID. Later (assuming you've chosen to
let Cassandra sort the columns or supercolumns by timeUUID), you can fetch
all the requests for a given user in either most-recent, or least-recent
order.  

From: Sameer Farooqui [mailto:cassandral...@gmail.com] 
Sent: Tuesday, June 14, 2011 8:16 PM
To: user@cassandra.apache.org
Subject: When does it make sense to use TimeUUID?

 

I would like to store some timestamped user info in a Column Family with the
usernames as the row key and different timestamps as column names. Each user
might have a thousand timestamped data.

 

I understand that the ver 1 UUIDs that Cassandra combines the MAC address of
the computer generating the UUID with the number of 100-nanosecond intervals
since the beginning of the Gregorian calendar.

 

So, if user1 had data stored for an event at Jan 30, 2011/2:15pm and user2
had an event at the exact same time, the data could potentially be stored in
different column names? So, I would have to know the MAC of the generating
computer in order to do a column slice, right? 

 

When does it make sense to use TimeUUID vs just a time string like
20110130141500 and comparator type UTF8?

 

- Sameer



RE: When does it make sense to use TimeUUID?

2011-06-14 Thread Kevin
Correction. TimeUUID comparisons FIRST compare the  time-based portion, then
go on to the other portion.

 

From: Sameer Farooqui [mailto:cassandral...@gmail.com] 
Sent: Tuesday, June 14, 2011 8:16 PM
To: user@cassandra.apache.org
Subject: When does it make sense to use TimeUUID?

 

I would like to store some timestamped user info in a Column Family with the
usernames as the row key and different timestamps as column names. Each user
might have a thousand timestamped data.

 

I understand that the ver 1 UUIDs that Cassandra combines the MAC address of
the computer generating the UUID with the number of 100-nanosecond intervals
since the beginning of the Gregorian calendar.

 

So, if user1 had data stored for an event at Jan 30, 2011/2:15pm and user2
had an event at the exact same time, the data could potentially be stored in
different column names? So, I would have to know the MAC of the generating
computer in order to do a column slice, right? 

 

When does it make sense to use TimeUUID vs just a time string like
20110130141500 and comparator type UTF8?

 

- Sameer



Re: When does it make sense to use TimeUUID?

2011-06-14 Thread Sameer Farooqui
Cool, thanks for the Clarification, Kevin.


On Tue, Jun 14, 2011 at 5:43 PM, Kevin  wrote:

> Correction. TimeUUID comparisons FIRST compare the  time-based portion,
> then go on to the other portion.
>
>
>

On Tue, Jun 14, 2011 at 5:41 PM, Kevin  wrote:

> TimeUUIDs should be used for data that is time-based and requires
> uniqueness.
>
>
>
> TimeUUID comparisons compare the time-based portion of the UUID. So no, you
> do not need to know the MAC addresses. In fact, for languages that cannot
> get to that low of a level to access a MAC address (like Java), the timeUUID
> tools generate random data for that part of the UUID.
>
>
>
> I don’t understand your “user1”/”user2” scenario. The timeUUIDs in that
> scenario wouldn’t even come in to question because the columns would be in
> two different rows since they pertain to two different users (unless they
> are referencing some other Column Family where those same TimeUUIDs are rows
> or columns in the same row).
>
>
>
> A good example of something that timeUUIDs would be great for would be
> friend requests. Regular time-based strings would not be sufficient in this
> case since it’s possible that two requests can be sent from two different
> computers at the same time. Thus, you can store the requests as columns (or
> super columns) each named by a timeUUID. Later (assuming you’ve chosen to
> let Cassandra sort the columns or supercolumns by timeUUID), you can fetch
> all the requests for a given user in either most-recent, or least-recent
> order.
>
> *From:* Sameer Farooqui [mailto:cassandral...@gmail.com]
> *Sent:* Tuesday, June 14, 2011 8:16 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* When does it make sense to use TimeUUID?
>
>
>
> I would like to store some timestamped user info in a Column Family with
> the usernames as the row key and different timestamps as column names. Each
> user might have a thousand timestamped data.
>
>
>
> I understand that the ver 1 UUIDs that Cassandra combines the MAC address
> of the computer generating the UUID with the number of 100-nanosecond
> intervals since the beginning of the Gregorian calendar.
>
>
>
> So, if user1 had data stored for an event at Jan 30, 2011/2:15pm and user2
> had an event at the exact same time, the data could potentially be stored in
> different column names? So, I would have to know the MAC of the generating
> computer in order to do a column slice, right?
>
>
>
> When does it make sense to use TimeUUID vs just a time string like
> 20110130141500 and comparator type UTF8?
>
>
>
> - Sameer
>


Multi data center configuration - A question on read correction

2011-06-14 Thread Selva Kumar
I have setup a multiple data center configuration in Cassandra. My primary 
intention is to minimize the network traffic between DC1 and DC2. Want DC1 read 
requests be served with out reaching DC2 nodes. After going through 
documentation, i felt following setup would do. 



Replica Placement Strategy: NetworkTopologyStrategy 
Replication Factor: 3 
strategy_options: 
DC1 : 2 
DC2 : 1 
endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch 
Read Consistency Level: LOCAL_QUORUM 
Write Consistency Level: LOCAL_QUORUM 

File: cassandra-topology.properties 
# Cassandra Node IP=Data Center:Rack 
10.10.10.149=DC1:RAC1 
10.10.10.150=DC1:RAC1 
10.10.10.151=DC1:RAC1 

10.20.10.153=DC2:RAC1 
10.20.10.154=DC2:RAC1 
# default for unknown nodes 
default=DC1:RAC1 

Question I have: 
1. Created a java program to test. It was querying with consistency level 
LOCAL_QUORUM on a DC1 node. Read count(Through cfstats) on the DC2 node showed 
read happened there too. Is it because of read correction?. Is there way to 
avoid doing read correction in DC2 nodes, when we query DC1 nodes. 


Thanks 
Selva 

Re: Multi data center configuration - A question on read correction

2011-06-14 Thread Jonathan Ellis
That's just read repair sending MD5s of the data for comparison.  So
net traffic is light.

You can turn off RR but the downsides can be large.  Turning it down
to say 10% can be reasonable tho.

But again, if network traffic is your concern you should be fine.

On Tue, Jun 14, 2011 at 8:44 PM, Selva Kumar  wrote:
> I have setup a multiple data center configuration in Cassandra. My primary
> intention is to minimize the network traffic between DC1 and DC2. Want DC1
> read requests be served with out reaching DC2 nodes. After going through
> documentation, i felt following setup would do.
>
>
> Replica Placement Strategy: NetworkTopologyStrategy
> Replication Factor: 3
> strategy_options:
> DC1 : 2
> DC2 : 1
> endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch
> Read Consistency Level: LOCAL_QUORUM
> Write Consistency Level: LOCAL_QUORUM
>
> File: cassandra-topology.properties
> # Cassandra Node IP=Data Center:Rack
> 10.10.10.149=DC1:RAC1
> 10.10.10.150=DC1:RAC1
> 10.10.10.151=DC1:RAC1
>
> 10.20.10.153=DC2:RAC1
> 10.20.10.154=DC2:RAC1
> # default for unknown nodes
> default=DC1:RAC1
>
> Question I have:
> 1. Created a java program to test. It was querying with consistency level
> LOCAL_QUORUM on a DC1 node. Read count(Through cfstats) on the DC2 node
> showed read happened there too. Is it because of read correction?. Is there
> way to avoid doing read correction in DC2 nodes, when we query DC1 nodes.
>
> Thanks
> Selva



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: one way to make counter delete work better

2011-06-14 Thread Yang
I almost got the code done, should release in a bit.



your scenario is not a problem concerned with implementation, but really
with definition of "same time". remember that in a distributed system, there
is no absolute physical time concept, time is just another way of saying
"before or after". in your scenario, since DCA and DCB are cut off, and
there are no messages between them, you can NOT determine logically whether
you should say the delete is before +3 or after it. you may say "hey, the
timestamp I gave +3 is higher", but DCA may say:" your timestamp is just
drifted, actually my delete happened later"

in fact here is a stronger reason that you have to let go of the +3, because
it might have already been merged up by +1 , which happened in physical time
earlier than our DCA delete, and a +2 which happened after the DCA delete,
now what would you say about whether the +3 is before or after our DCA
delete? the only correct way to order them is to say:" sorry DCB: you missed
the delete, all your latter +2 operations were just a snapshot earlier in
time, the eventual result is the delete.  in other words, it is futile
to update on a dead epoch while others have started a new one". this is the
same dilemma that you face during sstable merging

overall, I think it's easier to understand it if we realize that once you
delete, all further edits on the counter is futile, epoch is another way of
saying creating a completely new counter, the counter name we are using is
just kind of an alias.


yang


On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne wrote:

> Who assigns those epoch numbers ?
> You need all nodes to agree on the epoch number somehow to have this work,
> but then how do you maintain those in a partition tolerant distributed
> system ?
>
> I may have missed some parts of your proposal but let me consider a
> scenario
> that we have to be able to handle: consider two nodes A and B (RF=2) each
> in
> one data center (DCA and DCB) and a counter c. Suppose you do a +2
> increment
> on c that both nodes get. Now let say you have a network split and the
> connection
> between your 2 data center fails. In DCA you delete c, only A gets it.
> In DCB, you
> do more increments on c (say +3), only B gets it. The partition can
> last for hours.
> For deletion to work, we would need that whenever the network
> partition is resolved,
> both node eventually agree on the value 3 (i.e, only the second increment).
> I don't see how you could assign epoch numbers or anything to fix that.
>
> --
> Sylvain
>
> On Mon, Jun 13, 2011 at 8:26 PM, Yang  wrote:
> > ok, I think it's better to understand it this way, then it is really
> simple
> > and intuitive:
> > my proposed way of counter update can be simply seen as a combination of
> > regular columns + current counter columns:
> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
> epoch
> > number]
> > then within each epoch, counter updates work as currently implemented
> >
> >
> > On Mon, Jun 13, 2011 at 10:12 AM, Yang  wrote:
> >>
> >> I think this approach also works for your scenario:
> >> I thought that the issue is only concerned with merging within the same
> >> leader; but you pointed out
> >> that a similar merging happens between leaders too, now I see that the
> >> same rules on epoch number
> >> also applies to inter-leader data merging, specifically in your case:
> >>
> >> everyone starts with epoch of 0, ( they should be same, if not, it also
> >> works, we just consider them to be representing diffferent time
> snapshots of
> >> the same counter state)
> >> node A  add 1clock:  0.100  (epoch = 0, clock number = 100)
> >> node A  deleteclock:  0.200
> >> node B add 2 clock:  0.300
> >> node Agets B's state:  add 2 clock 0.300, but rejects it because A
> has
> >> already produced a delete, with epoch of 0, so A considers epoch 0
> already
> >> ended, it won't accept any replicated state with epoch < 1.
> >> node Bgets A's delete  0.200,  it zeros its own count of "2", and
> >> updates its future expected epoch to 1.
> >> at this time, the state of system is:
> >> node A expected epoch =1  [A:nil] [B:nil]
> >> same for node B
> >>
> >>
> >> let's say we have following further writes:
> >> node B  add 3  clock  1.400
> >> node A adds 4  clock 1.500
> >> node B receives A's add 4,   node B updates its copy of A
> >> node A receives B's add 3,updates its copy of B
> >>
> >> then state is:
> >> node A  , expected epoch == 1[A:4  clock=400] [B:3   clock=500]
> >> node B same
> >>
> >>
> >> generally I think it should be complete if we add the following rule for
> >> inter-leader replication:
> >> each leader keeps a var in memory (and also persist to sstable when
> >> flushing)  expected_epoch , initially set to 0
> >> node P does:
> >> on receiving updates from  node Q
> >> if Q.expected_epoch > P.expected_epoch
> >>   /** an epoch bump inherently means a previous del

Re: one way to make counter delete work better

2011-06-14 Thread Yang
in "stronger reason", I mean the +3 is already merged up in memtable of node
B, you can't find +1 and +2 any more



On Tue, Jun 14, 2011 at 7:02 PM, Yang  wrote:

> I almost got the code done, should release in a bit.
>
>
>
> your scenario is not a problem concerned with implementation, but really
> with definition of "same time". remember that in a distributed system, there
> is no absolute physical time concept, time is just another way of saying
> "before or after". in your scenario, since DCA and DCB are cut off, and
> there are no messages between them, you can NOT determine logically whether
> you should say the delete is before +3 or after it. you may say "hey, the
> timestamp I gave +3 is higher", but DCA may say:" your timestamp is just
> drifted, actually my delete happened later"
>
> in fact here is a stronger reason that you have to let go of the +3,
> because it might have already been merged up by +1 , which happened in
> physical time earlier than our DCA delete, and a +2 which happened after the
> DCA delete, now what would you say about whether the +3 is before or after
> our DCA delete? the only correct way to order them is to say:" sorry DCB:
> you missed the delete, all your latter +2 operations were just a snapshot
> earlier in time, the eventual result is the delete.  in other words, it
> is futile to update on a dead epoch while others have started a new one".
> this is the same dilemma that you face during sstable merging
>
> overall, I think it's easier to understand it if we realize that once you
> delete, all further edits on the counter is futile, epoch is another way of
> saying creating a completely new counter, the counter name we are using is
> just kind of an alias.
>
>
> yang
>
>
> On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne 
> wrote:
>
>> Who assigns those epoch numbers ?
>> You need all nodes to agree on the epoch number somehow to have this work,
>> but then how do you maintain those in a partition tolerant distributed
>> system ?
>>
>> I may have missed some parts of your proposal but let me consider a
>> scenario
>> that we have to be able to handle: consider two nodes A and B (RF=2) each
>> in
>> one data center (DCA and DCB) and a counter c. Suppose you do a +2
>> increment
>> on c that both nodes get. Now let say you have a network split and the
>> connection
>> between your 2 data center fails. In DCA you delete c, only A gets it.
>> In DCB, you
>> do more increments on c (say +3), only B gets it. The partition can
>> last for hours.
>> For deletion to work, we would need that whenever the network
>> partition is resolved,
>> both node eventually agree on the value 3 (i.e, only the second
>> increment).
>> I don't see how you could assign epoch numbers or anything to fix that.
>>
>> --
>> Sylvain
>>
>> On Mon, Jun 13, 2011 at 8:26 PM, Yang  wrote:
>> > ok, I think it's better to understand it this way, then it is really
>> simple
>> > and intuitive:
>> > my proposed way of counter update can be simply seen as a combination of
>> > regular columns + current counter columns:
>> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
>> epoch
>> > number]
>> > then within each epoch, counter updates work as currently implemented
>> >
>> >
>> > On Mon, Jun 13, 2011 at 10:12 AM, Yang  wrote:
>> >>
>> >> I think this approach also works for your scenario:
>> >> I thought that the issue is only concerned with merging within the same
>> >> leader; but you pointed out
>> >> that a similar merging happens between leaders too, now I see that the
>> >> same rules on epoch number
>> >> also applies to inter-leader data merging, specifically in your case:
>> >>
>> >> everyone starts with epoch of 0, ( they should be same, if not, it also
>> >> works, we just consider them to be representing diffferent time
>> snapshots of
>> >> the same counter state)
>> >> node A  add 1clock:  0.100  (epoch = 0, clock number = 100)
>> >> node A  deleteclock:  0.200
>> >> node B add 2 clock:  0.300
>> >> node Agets B's state:  add 2 clock 0.300, but rejects it because A
>> has
>> >> already produced a delete, with epoch of 0, so A considers epoch 0
>> already
>> >> ended, it won't accept any replicated state with epoch < 1.
>> >> node Bgets A's delete  0.200,  it zeros its own count of "2", and
>> >> updates its future expected epoch to 1.
>> >> at this time, the state of system is:
>> >> node A expected epoch =1  [A:nil] [B:nil]
>> >> same for node B
>> >>
>> >>
>> >> let's say we have following further writes:
>> >> node B  add 3  clock  1.400
>> >> node A adds 4  clock 1.500
>> >> node B receives A's add 4,   node B updates its copy of A
>> >> node A receives B's add 3,updates its copy of B
>> >>
>> >> then state is:
>> >> node A  , expected epoch == 1[A:4  clock=400] [B:3   clock=500]
>> >> node B same
>> >>
>> >>
>> >> generally I think it should be complete if we add the following rule
>> for
>> >> inter-leader replic

Re: one way to make counter delete work better

2011-06-14 Thread Yang
yes epoch is generated by each node, in the replica set,  upon a delete
operation.

epoch is **global** to the replica set, for one counter, in contrast to
clock, with is local to partition.
different counters have different epoch numbers , because different counters
can be seen as completely different state machines, you can view
the nodes in the RF as acting as a separate node for each counter, i.e.
there are millions of replica set, separately, each for one counter

in fact we already have the epoch concept here, just in the
timestampOfLastDelete, but the latter is used in a wrong way, it should
never be compared to timestamp().




On Tue, Jun 14, 2011 at 12:26 PM, Milind Parikh wrote:

> If I understand this correctly, then the epoch integer would be
> generated by each node. Since time always flows forward, the assumption
> would be, I suppose, that the epochs would be tagged with the node that
> generated them and additionally the counter would carry as much history as
> necessary (and presumably not all history at all times).
>
> Milind
>
>
> On Tue, Jun 14, 2011 at 2:21 PM, Sylvain Lebresne wrote:
>
>> Who assigns those epoch numbers ?
>> You need all nodes to agree on the epoch number somehow to have this work,
>> but then how do you maintain those in a partition tolerant distributed
>> system ?
>>
>> I may have missed some parts of your proposal but let me consider a
>> scenario
>> that we have to be able to handle: consider two nodes A and B (RF=2) each
>> in
>> one data center (DCA and DCB) and a counter c. Suppose you do a +2
>> increment
>> on c that both nodes get. Now let say you have a network split and the
>> connection
>> between your 2 data center fails. In DCA you delete c, only A gets it.
>> In DCB, you
>> do more increments on c (say +3), only B gets it. The partition can
>> last for hours.
>> For deletion to work, we would need that whenever the network
>> partition is resolved,
>> both node eventually agree on the value 3 (i.e, only the second
>> increment).
>> I don't see how you could assign epoch numbers or anything to fix that.
>>
>> --
>> Sylvain
>>
>> On Mon, Jun 13, 2011 at 8:26 PM, Yang  wrote:
>> > ok, I think it's better to understand it this way, then it is really
>> simple
>> > and intuitive:
>> > my proposed way of counter update can be simply seen as a combination of
>> > regular columns + current counter columns:
>> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
>> epoch
>> > number]
>> > then within each epoch, counter updates work as currently implemented
>> >
>> >
>> > On Mon, Jun 13, 2011 at 10:12 AM, Yang  wrote:
>> >>
>> >> I think this approach also works for your scenario:
>> >> I thought that the issue is only concerned with merging within the same
>> >> leader; but you pointed out
>> >> that a similar merging happens between leaders too, now I see that the
>> >> same rules on epoch number
>> >> also applies to inter-leader data merging, specifically in your case:
>> >>
>> >> everyone starts with epoch of 0, ( they should be same, if not, it also
>> >> works, we just consider them to be representing diffferent time
>> snapshots of
>> >> the same counter state)
>> >> node A  add 1clock:  0.100  (epoch = 0, clock number = 100)
>> >> node A  deleteclock:  0.200
>> >> node B add 2 clock:  0.300
>> >> node Agets B's state:  add 2 clock 0.300, but rejects it because A
>> has
>> >> already produced a delete, with epoch of 0, so A considers epoch 0
>> already
>> >> ended, it won't accept any replicated state with epoch < 1.
>> >> node Bgets A's delete  0.200,  it zeros its own count of "2", and
>> >> updates its future expected epoch to 1.
>> >> at this time, the state of system is:
>> >> node A expected epoch =1  [A:nil] [B:nil]
>> >> same for node B
>> >>
>> >>
>> >> let's say we have following further writes:
>> >> node B  add 3  clock  1.400
>> >> node A adds 4  clock 1.500
>> >> node B receives A's add 4,   node B updates its copy of A
>> >> node A receives B's add 3,updates its copy of B
>> >>
>> >> then state is:
>> >> node A  , expected epoch == 1[A:4  clock=400] [B:3   clock=500]
>> >> node B same
>> >>
>> >>
>> >> generally I think it should be complete if we add the following rule
>> for
>> >> inter-leader replication:
>> >> each leader keeps a var in memory (and also persist to sstable when
>> >> flushing)  expected_epoch , initially set to 0
>> >> node P does:
>> >> on receiving updates from  node Q
>> >> if Q.expected_epoch > P.expected_epoch
>> >>   /** an epoch bump inherently means a previous delete,
>> which
>> >> we probably missed , so we need to apply the delete
>> >>   a delete is global to all leaders, so apply it on all
>> my
>> >> replicas **/
>> >>  for all leaders in my vector
>> >>   count = nil
>> >>
>> >>  P.expected_epoch =  Q.expected_epoch
>> >> if Q.expected_epoch == P.expected_epo

Re: Docs: Token Selection

2011-06-14 Thread AJ

Yes, which means that the ranges overlap each other.

Is this just a convention, or is it technically required when using 
NetworkTopologyStrategy?  Would it be acceptable to split the ranges 
into quarters by ignoring the data centers, such as:


DC1
node 1 = 0  Range: (12, 16], (0, 0]
node 2 = 4  Range: (0, 4]

DC2
node 3 = 8  Range: (4, 8]
node 4 = 12   Range: (8, 12]

If this is OK, are there any drawbacks to this?



On 6/14/2011 6:10 PM, Vijay wrote:

Yes... Thats right...  If you are trying to say the below...

DC1
Node1 Owns 50%

(Ranges 8..4 -> 8..5 & 8..5 -> 0)

Node2 Owns 50%

(Ranges 0 -> 1 & 1 -> 8..4)


DC2
Node1 Owns 50%

(Ranges 8..5 -> 0 & 0 -> 1)

Node2 Owns 50%

(Ranges 1 -> 8..4 & 8..4 -> 8..5)


Regards,




On Tue, Jun 14, 2011 at 3:47 PM, AJ > wrote:


This http://wiki.apache.org/cassandra/Operations#Token_selection
 says:

"With NetworkTopologyStrategy, you should calculate the tokens the
nodes in each DC independantly."

and gives the example:

DC1
node 1 = 0
node 2 = 85070591730234615865843651857942052864

DC2
node 3 = 1
node 4 = 85070591730234615865843651857942052865


So, according to the above, the token ranges would be (abbreviated
nums):

DC1
node 1 = 0  Range: (8..4, 16], (0, 0]
node 2 = 8..4   Range: (0, 8..4]

DC2
node 3 = 1  Range: (8..5, 16], (0, 1]
node 4 = 8..5   Range: (1, 8..5]


If the above is correct, then I would be surprised as this
paragraph is the only place were one would discover this and may
be easy to miss... unless there's a doc buried somewhere in plain
view that I missed.

So, have I interpreted this paragraph correctly?  Was this design
to help keep data somewhat localized if that was important, such
as a geographically dispersed DC?

Thanks!






Re: Migration question

2011-06-14 Thread Marcos Ortiz



El 6/14/2011 1:43 PM, Eric Czech escribió:

Thanks Aaron.  I'll make sure to copy the system tables.

Another thing -- do you have any suggestions on raid configurations 
for main data drives?  We're looking at RAID5 and 10 and I can't seem 
to find a convincing argument one way or the other.
Well, I learned from administrating other databases (like PostgreSQL and 
Oracle) that RAID 10 is the best solution for data. With RAID 5, the 
discs suffer a lot for the excesive I/O and It can arrive to

data lost. You can search about the "RAID 5 Write Hole" to view this.



Thanks again for your help.

On Mon, Jun 6, 2011 at 5:45 AM, aaron morton > wrote:


Sounds like you are OK to turn off the existing cluster first.

Assuming so, deliver any hints using JMX then do a nodetool flush
to write out all the memtables and checkpoint the commit logs. You
can then copy the data directories.

The System data directory contains the nodes token and the schema,
you will want to copy this directory. You may also want to copy
the cassandra.yaml or create new ones with the correct initial tokens.

The nodes will sort themselves out when they start up and get new
IP's, the important thing to them is the token.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 6 Jun 2011, at 23:25, Eric Czech wrote:

> Hi, I have a quick question about migrating a cluster.
>
> We have a cassandra cluster with 10 nodes that we'd like to move
to a new DC and what I was hoping to do is just copy the SSTables
for each node to a corresponding node in the new DC (the new
cluster will also have 10 nodes).  Is there any reason that a
straight file copy like this wouldn't work?  Do any system tables
need to be moved as well or is there anything else that needs to
be done?
>
> Thanks!




--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186
  



AW: New web client & future API

2011-06-14 Thread MW | Codefreun.de
Ok, many thanks.

I can remember a post (I think it was Jonathan) where they wanted to get
away from Thrift because of the weak development.

Markus ;)



-Ursprüngliche Nachricht-
Von: aaron morton [mailto:aa...@thelastpickle.com] 
Gesendet: Mittwoch, 15. Juni 2011 00:05
An: user@cassandra.apache.org
Betreff: Re: New web client & future API

AFAIK...

Avro is dead. 

Thrift is the current API and currently the only full featured API. 

CQL is a possible future API, given community support and development time
it may become the only API. The initial release is not feature complete
(e.g. missing some DDL  statements) and still uses thrift as the wire
protocol. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 15 Jun 2011, at 02:01, Markus Wiesenbacher | Codefreun.de wrote:

> Yes, I wanted to start from the base ...
> 
> 
> Am 14.06.2011 um 15:48 schrieb Sasha Dolgy :
> 
>> Your application is built with the thrift bindings and not with a 
>> higher level client like Hector?
>> 
>> On Tue, Jun 14, 2011 at 3:42 PM, Markus Wiesenbacher | Codefreun.de 
>>  wrote:
>>> 
>>> Hi,
>>> 
>>> what is the future API for Cassandra? Thrift, Avro, CQL?
>>> 
>>> I just released an early version of my web client
>>> (http://www.codefreun.de/apollo) which is Thrift-based, and 
>>> therefore I would like to know what the future is ...
>>> 
>>> Many thanks
>>> MW
>>> 
>> 
>> 
>> 
>> --
>> Sasha Dolgy
>> sasha.do...@gmail.com



Re: Cassandra Statistics and Metrics

2011-06-14 Thread Viktor Jevdokimov
http://www.kjkoster.org/zapcat/Zapcat_JMX_Zabbix_Bridge.html

2011/6/14 Marcos Ortiz 

>  Where I can find the source code?
>
> El 6/14/2011 10:13 AM, Viktor Jevdokimov escribió:
>
> We're using open source monitoring solution Zabbix from
> http://www.zabbix.com/ using zapcat - not only for Cassandra but for the
> whole system.
>
>  As MX4J tools plugin is supported by Cassandra, support of zapcat in
> Cassandra by default is welcome - we have to use a wrapper to start zapcat
> agent.
>
> 2011/6/14 Marcos Ortiz 
>
>> Regards to all.
>> My team and me here on the University are working on a generic solution
>> for Monitoring and Capacity Planning for Open Sources Databases, and one of
>> the NoSQL db that we choosed to give it support is Cassandra.
>> Where I can find all the metrics and statistics of Cassandra? I'm thinking
>> for example:
>> - Available space
>> - Number of CF
>> and all kind of metrics
>>
>> We are using for this development: Python + Django + Twisted + Orbited +
>> jQuery. The idea behind is to build a Comet-based web application on top of
>> these technologies.
>> Any advice is welcome
>>
>> --
>> Marcos Luís Ortíz Valmaseda
>>  Software Engineer (UCI)
>>  http://marcosluis2186.posterous.com
>>  http://twitter.com/marcosluis2186
>>
>>
>
>
> --
> Marcos Luís Ortíz Valmaseda
>  Software Engineer (UCI)
>  http://marcosluis2186.posterous.com
>  http://twitter.com/marcosluis2186
>
>


Re: possible 'coming back to life' bug with counters

2011-06-14 Thread Viktor Jevdokimov
What if it is OK for our case and we need counters with TTL?
For us Counters and TTL both are important. After column is expired it is
not important what value counter will have.
Scanning millions rows just to delete expired ones is not a solution.

2011/6/14 Sylvain Lebresne 

> As listed here: http://wiki.apache.org/cassandra/Counters, counter
> deletion is
> provided as a convenience for permanent deletion of counters but, because
> of the design of counters, it is never safe to issue an increment on a
> counter that
> has been deleted (that is, you will experience back to life behavior
> sometimes in
> that case).
> More precisely, you'd have to wait long enough after a deletion to start
> incrementing the counter again. But in the worst cases, long enough is
> something
> like gc_grace_seconds + major compaction.
>
> This is *not* something that is likely to change anytime soon (I don't
> think this is
> fixable with the current design for counters).
>
> --
> Sylvain
>
> On Sat, Jun 11, 2011 at 3:54 AM, David Hawthorne 
> wrote:
> > Please take a look at this thread over in the hector-users mailing list:
> >
> http://groups.google.com/group/hector-users/browse_thread/thread/99835159b9ea1766
> > It looks as if the deleted columns are coming back to life when they
> > shouldn't be.
> > I don't want to open a bug on something if it's already got one that I
> just
> > couldn't find when I scanned the list of open bugs.
> > I'm using hector 0.8 against cassandra 0.8 release.  I can give you
> whatever
> > logs or files you'd like.
>


Re: one way to make counter delete work better

2011-06-14 Thread Yang
patch in https://issues.apache.org/jira/browse/CASSANDRA-2774

some coding is messy
and only intended for demonstration only, we could refine it after we agree
this is a feasible way to go.


Thanks
Yang

On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne wrote:

> Who assigns those epoch numbers ?
> You need all nodes to agree on the epoch number somehow to have this work,
> but then how do you maintain those in a partition tolerant distributed
> system ?
>
> I may have missed some parts of your proposal but let me consider a
> scenario
> that we have to be able to handle: consider two nodes A and B (RF=2) each
> in
> one data center (DCA and DCB) and a counter c. Suppose you do a +2
> increment
> on c that both nodes get. Now let say you have a network split and the
> connection
> between your 2 data center fails. In DCA you delete c, only A gets it.
> In DCB, you
> do more increments on c (say +3), only B gets it. The partition can
> last for hours.
> For deletion to work, we would need that whenever the network
> partition is resolved,
> both node eventually agree on the value 3 (i.e, only the second increment).
> I don't see how you could assign epoch numbers or anything to fix that.
>
> --
> Sylvain
>
> On Mon, Jun 13, 2011 at 8:26 PM, Yang  wrote:
> > ok, I think it's better to understand it this way, then it is really
> simple
> > and intuitive:
> > my proposed way of counter update can be simply seen as a combination of
> > regular columns + current counter columns:
> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
> epoch
> > number]
> > then within each epoch, counter updates work as currently implemented
> >
> >
> > On Mon, Jun 13, 2011 at 10:12 AM, Yang  wrote:
> >>
> >> I think this approach also works for your scenario:
> >> I thought that the issue is only concerned with merging within the same
> >> leader; but you pointed out
> >> that a similar merging happens between leaders too, now I see that the
> >> same rules on epoch number
> >> also applies to inter-leader data merging, specifically in your case:
> >>
> >> everyone starts with epoch of 0, ( they should be same, if not, it also
> >> works, we just consider them to be representing diffferent time
> snapshots of
> >> the same counter state)
> >> node A  add 1clock:  0.100  (epoch = 0, clock number = 100)
> >> node A  deleteclock:  0.200
> >> node B add 2 clock:  0.300
> >> node Agets B's state:  add 2 clock 0.300, but rejects it because A
> has
> >> already produced a delete, with epoch of 0, so A considers epoch 0
> already
> >> ended, it won't accept any replicated state with epoch < 1.
> >> node Bgets A's delete  0.200,  it zeros its own count of "2", and
> >> updates its future expected epoch to 1.
> >> at this time, the state of system is:
> >> node A expected epoch =1  [A:nil] [B:nil]
> >> same for node B
> >>
> >>
> >> let's say we have following further writes:
> >> node B  add 3  clock  1.400
> >> node A adds 4  clock 1.500
> >> node B receives A's add 4,   node B updates its copy of A
> >> node A receives B's add 3,updates its copy of B
> >>
> >> then state is:
> >> node A  , expected epoch == 1[A:4  clock=400] [B:3   clock=500]
> >> node B same
> >>
> >>
> >> generally I think it should be complete if we add the following rule for
> >> inter-leader replication:
> >> each leader keeps a var in memory (and also persist to sstable when
> >> flushing)  expected_epoch , initially set to 0
> >> node P does:
> >> on receiving updates from  node Q
> >> if Q.expected_epoch > P.expected_epoch
> >>   /** an epoch bump inherently means a previous delete,
> which
> >> we probably missed , so we need to apply the delete
> >>   a delete is global to all leaders, so apply it on all
> my
> >> replicas **/
> >>  for all leaders in my vector
> >>   count = nil
> >>
> >>  P.expected_epoch =  Q.expected_epoch
> >> if Q.expected_epoch == P.expected_epoch
> >>  update P's copy of Q according to standard rules
> >> /** if Q.expected_epoch < P.expected_epoch  , that means Q is
> less
> >> up to date than us, just ignore
> >>
> >> replicate_on_write(to Q):
> >>   if  P.operation == delete
> >> P.expected_epoch ++
> >> set all my copies of all leaders to nil
> >>   send to Q ( P.total , P.expected_epoch)
> >>
> >>
> >>
> >> overall I don't think delete being not commutative is a fundamental
> >> blocker : regular columns are also not commutative, yet we achieve
> stable
> >> result no matter what order they are applied, because of the ordering
> rule
> >> used in reconciliation; here we just need to find a similar ordering
> rule.
> >> the epoch thing could be a step on this direction.
> >>
> >> Thanks
> >> Yang
> >>
> >>
> >>
> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis 
> wrote:
> >>>
>