Re: Multi-tenancy, and authentication and authorization

2011-01-20 Thread David Boxenhorn
As far as I can tell, if Cassandra supports three levels of configuration
(server, keyspace, column family) we can support multi-tenancy. It is
trivial to give each tenant their own keyspace (e.g. just use the tenant's
id as the keyspace name) and let them go wild. (Any out-of-bounds behavior
on the CF level will be stopped at the keyspace and server level before
doing any damage.)

I don't think Cassandra needs to know about end-users. From Cassandra's
point of view the tenant is the user.

On Thu, Jan 20, 2011 at 7:00 AM, indika kumara wrote:

> +1   Are there JIRAs for these requirements? I would like to contribute
> from my capacity.
>
> As per my understanding, to support some muti-tenant models, it is needed
> to qualified keyspaces' names, Cfs' names, etc. with the tenant namespace
> (or id). The easiest way to do this would be to modify corresponding
> constructs transparently. I tought of a stage (optional and configurable)
> prior to authorization. Is there any better solutions? I appreciate the
> community's suggestions.
>
> Moreover, It is needed to send the tenant NS(id) with the user credentials
> (A users belongs to this tenant (org.)). For that purpose, I thought of
> using the user credentials in the AuthenticationRequest. s there any better
> solution?
>
> I would like to have a MT support at the Cassandra level which is optional
> and configurable.
>
> Thanks,
>
> Indika
>
>
> On Wed, Jan 19, 2011 at 7:40 PM, David Boxenhorn wrote:
>
>> Yes, the way I see it - and it becomes even more necessary for a
>> multi-tenant configuration - there should be completely separate
>> configurations for applications and for servers.
>>
>> - Application configuration is based on data and usage characteristics of
>> your application.
>> - Server configuration is based on the specific hardware limitations of
>> the server.
>>
>> Obviously, server limitations take priority over application
>> configuration.
>>
>> Assuming that each tenant in a multi-tenant environment gets one keyspace,
>> you would also want to enforce limitations based on keyspace (which
>> correspond to parameters that the tenant payed for).
>>
>> So now we have three levels:
>>
>> 1. Server configuration (top priority)
>> 2. Keyspace configuration (payed-for service - second priority)
>> 3. Column family configuration (configuration provided by tenant - third
>> priority)
>>
>>
>> On Wed, Jan 19, 2011 at 3:15 PM, indika kumara wrote:
>>
>>> As the actual problem is mostly related to the number of CFs in the
>>> system (may be number of the columns), I still believe that supporting
>>> exposing the Cassandra ‘as-is’ to a tenant is doable and suitable though
>>> need some fixes.  That multi-tenancy model allows a tenant to use the
>>> programming model of the Cassandra ‘as-is’, enabling the seamless migration
>>> of an application that uses the Cassandra into the cloud. Moreover, In order
>>> to support different SLA requirements of different tenants, the
>>> configurability of keyspaces, cfs, etc., per tenant may be critical.
>>> However, there are trade-offs among usability, memory consumption, and
>>> performance. I believe that it is important to consider the SLA requirements
>>> of different tenants when deciding the strategies for controlling resource
>>> consumption.
>>>
>>> I like to the idea of system-wide parameters for controlling resource
>>> usage. I believe that the tenant-specific parameters are equally important.
>>> There are resources, and each tenant can claim a portion of them based on
>>> SLA. For instance, if there is a threshold on the number of columns per a
>>> node, it should be able to decide how many columns a particular tenant can
>>> have.  It allows selecting a suitable Cassandra cluster for a tenant based
>>> on his or her SLA. I believe the capability to configure resource
>>> controlling parameters per keyspace would be important to support a keyspace
>>> per tenant model. Furthermore, In order to maximize the resource sharing
>>> among tenants, a threshold (on a resource) per keyspace should not be a hard
>>> limit. Rather, it should be oscillated between a hard minimum and a maximum.
>>> For example, if a particular tenant needs more resources at a given time, he
>>> or she should be possible to borrow from the others up to the maximum. The
>>> threshold is only considered when a tenant is assigned to a cluster - the
>>> remaining resources of a cluster should be equal or higher than the resource
>>> limit of the tenant. It may need to spread a single keyspace across multiple
>>> clusters; especially when there are no enough resources in a single
>>> cluster.
>>>
>>> I believe that it would be better to have a flexibility to change
>>> seamlessly multi-tenancy implementation models such as the Cassadra ‘as-is’,
>>> the keyspace per tenant model, a keyspace for all tenants, and so on.  Based
>>> on what I have learnt, each model requires adding tenant id (name space) to
>>> a keyspace’s name or cf’s nam

Re: Upgrading from 0.6 to 0.7.0

2011-01-20 Thread Daniel Josefsson
In our case our replication factor is more than half the number of nodes in
the cluster.

Would it be possible to do the following:

   - Upgrade half of them
   - Change Thrift Port and inter-server port (is this the storage_port?)
   - Start them up
   - Upgrade clients one by one
   - Upgrade the the rest of the servers

Or might we get some kind of data collision when still writing to the old
cluster as the new storage is being used?

/Daniel


Embedded Cassandra server startup question

2011-01-20 Thread Roshan Dawrani
Hi,

I am using Cassandra for a Grails application and in that I start the
embedded server when the Spring application context gets built.

When I run my Grails app test suite - it first runs the integration and then
functional test suite and it builds the application text individually for
each phase.

When it brings the up the embedded Cassandra server in 2nd phase (for
functional tests), it fails saying "*Attempt to assign id to existing column
family.*"

Anyone familiar with this error? Is it because both the test phases are
executed in the same JVM instance and there is some Cassandra meta-data from
phase 1 server start that is affecting the server startup in 2nd phase?

Any way I can cleanly start the server 2 times in my case? Any other
suggestion? Thanks.

-- 
Roshan
Blog: http://roshandawrani.wordpress.com/
Twitter: @roshandawrani 
Skype: roshandawrani


Re: Multi-tenancy, and authentication and authorization

2011-01-20 Thread indika kumara
Thanks David We decided to do it at our client-side as the initial
implementation. I will investigate the approaches for supporting the fine
grained control of the resources consumed by a sever, tenant, and CF.

Thanks,

Indika

On Thu, Jan 20, 2011 at 3:20 PM, David Boxenhorn  wrote:

> As far as I can tell, if Cassandra supports three levels of configuration
> (server, keyspace, column family) we can support multi-tenancy. It is
> trivial to give each tenant their own keyspace (e.g. just use the tenant's
> id as the keyspace name) and let them go wild. (Any out-of-bounds behavior
> on the CF level will be stopped at the keyspace and server level before
> doing any damage.)
>
> I don't think Cassandra needs to know about end-users. From Cassandra's
> point of view the tenant is the user.
>
> On Thu, Jan 20, 2011 at 7:00 AM, indika kumara wrote:
>
>> +1   Are there JIRAs for these requirements? I would like to contribute
>> from my capacity.
>>
>> As per my understanding, to support some muti-tenant models, it is needed
>> to qualified keyspaces' names, Cfs' names, etc. with the tenant namespace
>> (or id). The easiest way to do this would be to modify corresponding
>> constructs transparently. I tought of a stage (optional and configurable)
>> prior to authorization. Is there any better solutions? I appreciate the
>> community's suggestions.
>>
>> Moreover, It is needed to send the tenant NS(id) with the user credentials
>> (A users belongs to this tenant (org.)). For that purpose, I thought of
>> using the user credentials in the AuthenticationRequest. s there any better
>> solution?
>>
>> I would like to have a MT support at the Cassandra level which is optional
>> and configurable.
>>
>> Thanks,
>>
>> Indika
>>
>>
>> On Wed, Jan 19, 2011 at 7:40 PM, David Boxenhorn wrote:
>>
>>> Yes, the way I see it - and it becomes even more necessary for a
>>> multi-tenant configuration - there should be completely separate
>>> configurations for applications and for servers.
>>>
>>> - Application configuration is based on data and usage characteristics of
>>> your application.
>>> - Server configuration is based on the specific hardware limitations of
>>> the server.
>>>
>>> Obviously, server limitations take priority over application
>>> configuration.
>>>
>>> Assuming that each tenant in a multi-tenant environment gets one
>>> keyspace, you would also want to enforce limitations based on keyspace
>>> (which correspond to parameters that the tenant payed for).
>>>
>>> So now we have three levels:
>>>
>>> 1. Server configuration (top priority)
>>> 2. Keyspace configuration (payed-for service - second priority)
>>> 3. Column family configuration (configuration provided by tenant - third
>>> priority)
>>>
>>>
>>> On Wed, Jan 19, 2011 at 3:15 PM, indika kumara wrote:
>>>
 As the actual problem is mostly related to the number of CFs in the
 system (may be number of the columns), I still believe that supporting
 exposing the Cassandra ‘as-is’ to a tenant is doable and suitable though
 need some fixes.  That multi-tenancy model allows a tenant to use the
 programming model of the Cassandra ‘as-is’, enabling the seamless migration
 of an application that uses the Cassandra into the cloud. Moreover, In 
 order
 to support different SLA requirements of different tenants, the
 configurability of keyspaces, cfs, etc., per tenant may be critical.
 However, there are trade-offs among usability, memory consumption, and
 performance. I believe that it is important to consider the SLA 
 requirements
 of different tenants when deciding the strategies for controlling resource
 consumption.

 I like to the idea of system-wide parameters for controlling resource
 usage. I believe that the tenant-specific parameters are equally important.
 There are resources, and each tenant can claim a portion of them based on
 SLA. For instance, if there is a threshold on the number of columns per a
 node, it should be able to decide how many columns a particular tenant can
 have.  It allows selecting a suitable Cassandra cluster for a tenant based
 on his or her SLA. I believe the capability to configure resource
 controlling parameters per keyspace would be important to support a 
 keyspace
 per tenant model. Furthermore, In order to maximize the resource sharing
 among tenants, a threshold (on a resource) per keyspace should not be a 
 hard
 limit. Rather, it should be oscillated between a hard minimum and a 
 maximum.
 For example, if a particular tenant needs more resources at a given time, 
 he
 or she should be possible to borrow from the others up to the maximum. The
 threshold is only considered when a tenant is assigned to a cluster - the
 remaining resources of a cluster should be equal or higher than the 
 resource
 limit of the tenant. It may need to spread a single keyspace across 
 mul

Re: Multi-tenancy, and authentication and authorization

2011-01-20 Thread David Boxenhorn
I have added my comments to this issue:

https://issues.apache.org/jira/browse/CASSANDRA-2006

Good luck!

On Thu, Jan 20, 2011 at 1:53 PM, indika kumara wrote:

> Thanks David We decided to do it at our client-side as the initial
> implementation. I will investigate the approaches for supporting the fine
> grained control of the resources consumed by a sever, tenant, and CF.
>
> Thanks,
>
> Indika
>
> On Thu, Jan 20, 2011 at 3:20 PM, David Boxenhorn wrote:
>
>> As far as I can tell, if Cassandra supports three levels of configuration
>> (server, keyspace, column family) we can support multi-tenancy. It is
>> trivial to give each tenant their own keyspace (e.g. just use the tenant's
>> id as the keyspace name) and let them go wild. (Any out-of-bounds behavior
>> on the CF level will be stopped at the keyspace and server level before
>> doing any damage.)
>>
>> I don't think Cassandra needs to know about end-users. From Cassandra's
>> point of view the tenant is the user.
>>
>> On Thu, Jan 20, 2011 at 7:00 AM, indika kumara wrote:
>>
>>> +1   Are there JIRAs for these requirements? I would like to contribute
>>> from my capacity.
>>>
>>> As per my understanding, to support some muti-tenant models, it is needed
>>> to qualified keyspaces' names, Cfs' names, etc. with the tenant namespace
>>> (or id). The easiest way to do this would be to modify corresponding
>>> constructs transparently. I tought of a stage (optional and configurable)
>>> prior to authorization. Is there any better solutions? I appreciate the
>>> community's suggestions.
>>>
>>> Moreover, It is needed to send the tenant NS(id) with the user
>>> credentials (A users belongs to this tenant (org.)). For that purpose, I
>>> thought of using the user credentials in the AuthenticationRequest. s there
>>> any better solution?
>>>
>>> I would like to have a MT support at the Cassandra level which is
>>> optional and configurable.
>>>
>>> Thanks,
>>>
>>> Indika
>>>
>>>
>>> On Wed, Jan 19, 2011 at 7:40 PM, David Boxenhorn wrote:
>>>
 Yes, the way I see it - and it becomes even more necessary for a
 multi-tenant configuration - there should be completely separate
 configurations for applications and for servers.

 - Application configuration is based on data and usage characteristics
 of your application.
 - Server configuration is based on the specific hardware limitations of
 the server.

 Obviously, server limitations take priority over application
 configuration.

 Assuming that each tenant in a multi-tenant environment gets one
 keyspace, you would also want to enforce limitations based on keyspace
 (which correspond to parameters that the tenant payed for).

 So now we have three levels:

 1. Server configuration (top priority)
 2. Keyspace configuration (payed-for service - second priority)
 3. Column family configuration (configuration provided by tenant - third
 priority)


 On Wed, Jan 19, 2011 at 3:15 PM, indika kumara 
 wrote:

> As the actual problem is mostly related to the number of CFs in the
> system (may be number of the columns), I still believe that supporting
> exposing the Cassandra ‘as-is’ to a tenant is doable and suitable though
> need some fixes.  That multi-tenancy model allows a tenant to use the
> programming model of the Cassandra ‘as-is’, enabling the seamless 
> migration
> of an application that uses the Cassandra into the cloud. Moreover, In 
> order
> to support different SLA requirements of different tenants, the
> configurability of keyspaces, cfs, etc., per tenant may be critical.
> However, there are trade-offs among usability, memory consumption, and
> performance. I believe that it is important to consider the SLA 
> requirements
> of different tenants when deciding the strategies for controlling resource
> consumption.
>
> I like to the idea of system-wide parameters for controlling resource
> usage. I believe that the tenant-specific parameters are equally 
> important.
> There are resources, and each tenant can claim a portion of them based on
> SLA. For instance, if there is a threshold on the number of columns per a
> node, it should be able to decide how many columns a particular tenant can
> have.  It allows selecting a suitable Cassandra cluster for a tenant based
> on his or her SLA. I believe the capability to configure resource
> controlling parameters per keyspace would be important to support a 
> keyspace
> per tenant model. Furthermore, In order to maximize the resource sharing
> among tenants, a threshold (on a resource) per keyspace should not be a 
> hard
> limit. Rather, it should be oscillated between a hard minimum and a 
> maximum.
> For example, if a particular tenant needs more resources at a given time, 
> he
> or she should be possible to bor

Use Cassandra to store 2 million records of persons

2011-01-20 Thread Surender Singh
Hi All

I want to use Apache Cassandra to store information (like first name, last
name, gender, address)  about 2 million people.  Then need to perform
analytic and reporting on that data.
is need to store information about 2 million people in Mysql and then
transfer that information into Cassandra.?

Please help me as i m new to Apache Cassandra.

if you have any use case like that, please share.

Thanks and regards
Surender Singh


Re: Multi-tenancy, and authentication and authorization

2011-01-20 Thread Mimi Aluminium
Hi,

I have a question that somewhat related to the above.
Is there a tool that predicts the resource consumption (i.e, memory, disk,
CPU)  in an offline mode? Means it is given with the storage conf
parameters, ks, CFs and data model, and then application parameters such
read/write average rates. It should output the required sizes for memory,
disk etc.

I need to estimate costs for various configurations we might have and
thus  I am working on building "simple" excel  for my own data model  - but
then it came to my mind to ask wether something like that already exists.

BTW, I think such tool can also help for the issues that were discussed
before even though it will be built on averages which probably are no so
fine-grained but it can provide worse cases numbers to the application
that uses Cassandra

Thanks,
Miriam


==
Miriam Allalouf
n

On Thu, Jan 20, 2011 at 1:53 PM, indika kumara wrote:

> Thanks David We decided to do it at our client-side as the initial
> implementation. I will investigate the approaches for supporting the fine
> grained control of the resources consumed by a sever, tenant, and CF.
>
> Thanks,
>
> Indika
>
> On Thu, Jan 20, 2011 at 3:20 PM, David Boxenhorn wrote:
>
>> As far as I can tell, if Cassandra supports three levels of configuration
>> (server, keyspace, column family) we can support multi-tenancy. It is
>> trivial to give each tenant their own keyspace (e.g. just use the tenant's
>> id as the keyspace name) and let them go wild. (Any out-of-bounds behavior
>> on the CF level will be stopped at the keyspace and server level before
>> doing any damage.)
>>
>> I don't think Cassandra needs to know about end-users. From Cassandra's
>> point of view the tenant is the user.
>>
>> On Thu, Jan 20, 2011 at 7:00 AM, indika kumara wrote:
>>
>>> +1   Are there JIRAs for these requirements? I would like to contribute
>>> from my capacity.
>>>
>>> As per my understanding, to support some muti-tenant models, it is needed
>>> to qualified keyspaces' names, Cfs' names, etc. with the tenant namespace
>>> (or id). The easiest way to do this would be to modify corresponding
>>> constructs transparently. I tought of a stage (optional and configurable)
>>> prior to authorization. Is there any better solutions? I appreciate the
>>> community's suggestions.
>>>
>>> Moreover, It is needed to send the tenant NS(id) with the user
>>> credentials (A users belongs to this tenant (org.)). For that purpose, I
>>> thought of using the user credentials in the AuthenticationRequest. s there
>>> any better solution?
>>>
>>> I would like to have a MT support at the Cassandra level which is
>>> optional and configurable.
>>>
>>> Thanks,
>>>
>>> Indika
>>>
>>>
>>> On Wed, Jan 19, 2011 at 7:40 PM, David Boxenhorn wrote:
>>>
 Yes, the way I see it - and it becomes even more necessary for a
 multi-tenant configuration - there should be completely separate
 configurations for applications and for servers.

 - Application configuration is based on data and usage characteristics
 of your application.
 - Server configuration is based on the specific hardware limitations of
 the server.

 Obviously, server limitations take priority over application
 configuration.

 Assuming that each tenant in a multi-tenant environment gets one
 keyspace, you would also want to enforce limitations based on keyspace
 (which correspond to parameters that the tenant payed for).

 So now we have three levels:

 1. Server configuration (top priority)
 2. Keyspace configuration (payed-for service - second priority)
 3. Column family configuration (configuration provided by tenant - third
 priority)


 On Wed, Jan 19, 2011 at 3:15 PM, indika kumara 
 wrote:

> As the actual problem is mostly related to the number of CFs in the
> system (may be number of the columns), I still believe that supporting
> exposing the Cassandra ‘as-is’ to a tenant is doable and suitable though
> need some fixes.  That multi-tenancy model allows a tenant to use the
> programming model of the Cassandra ‘as-is’, enabling the seamless 
> migration
> of an application that uses the Cassandra into the cloud. Moreover, In 
> order
> to support different SLA requirements of different tenants, the
> configurability of keyspaces, cfs, etc., per tenant may be critical.
> However, there are trade-offs among usability, memory consumption, and
> performance. I believe that it is important to consider the SLA 
> requirements
> of different tenants when deciding the strategies for controlling resource
> consumption.
>
> I like to the idea of system-wide parameters for controlling resource
> usage. I believe that the tenant-specific parameters are equally 
> important.
> There are resources, and each tenant can claim a portion of them based on
> SLA. For instance, if there i

Under expectation response time for reads

2011-01-20 Thread George Ciubotaru
Hello,

We are in the process of evaluating Cassandra to be used with our product; I've 
started with some performance tests but unfortunately I'm getting very bad 
results for read operations (around 200 ms per read request which is much much 
more than what I'm reading that Cassandra can deliver).

- I'm using the latest stable Cassandra binaries (Cassandra 0.7) on Windows
- My cluster has 3 nodes (on 3 separate machines), only one seed node and 
replication factor of 1
- I've use batch_mutate to insert around 50,000 keys with an average of 60 
columns per key (no super-column)
- I'm using C# client
- The read operation I've tested was: for a random key get all its columns 
(using get_slice)

I have 2 types of results:
- as expected (very fast, around 1 ms per read request) when the client is 
running on one of the 3 machines and is connected with the local machine
- under expectation (200 ms per request) when the client is running on one of 
the 3 machines but is connected to one of the other 2 machines (except local 
machine).

It might be configuration issue but I cannot figure it out.

Any suggestion?

Thank you,
George

Re: Use Cassandra to store 2 million records of persons

2011-01-20 Thread David Boxenhorn
Cassandra is not a good solution for data mining type problems, since it
doesn't have ad-hoc queries. Cassandra is designed to maximize throughput,
which is not usually a problem for data mining.

On Thu, Jan 20, 2011 at 2:07 PM, Surender Singh wrote:

> Hi All
>
> I want to use Apache Cassandra to store information (like first name, last
> name, gender, address)  about 2 million people.  Then need to perform
> analytic and reporting on that data.
> is need to store information about 2 million people in Mysql and then
> transfer that information into Cassandra.?
>
> Please help me as i m new to Apache Cassandra.
>
> if you have any use case like that, please share.
>
> Thanks and regards
> Surender Singh
>
>


Compression in Cassandra

2011-01-20 Thread akshatbakli...@gmail.com
Hi all,

I am experiencing a unique situation. I loaded some data onto Cassandra.
my data was about 40 GB but when loaded to Cassandra the data directory size
is almost 170GB.

This means the **data got inflated**.

Is it the case just with me or some else is also facing the inflation or its
the general behavior of Cassandra.

I am using Cassandra 0.6.8. on Ubuntu 10.10

-- 
Akshat Bakliwal
Search Information and Extraction Lab
IIIT-Hyderabad
09963885762
WebPage



Re: Compression in Cassandra

2011-01-20 Thread Javier Canillas
How do you calculate your 40g data? When you insert it into Cassandra, you
need to convert the data into a Byte[], maybe your problem is there.

On Thu, Jan 20, 2011 at 10:02 AM, akshatbakli...@gmail.com <
akshatbakli...@gmail.com> wrote:

> Hi all,
>
> I am experiencing a unique situation. I loaded some data onto Cassandra.
> my data was about 40 GB but when loaded to Cassandra the data directory
> size is almost 170GB.
>
> This means the **data got inflated**.
>
> Is it the case just with me or some else is also facing the inflation or
> its the general behavior of Cassandra.
>
> I am using Cassandra 0.6.8. on Ubuntu 10.10
>
> --
> Akshat Bakliwal
> Search Information and Extraction Lab
> IIIT-Hyderabad
> 09963885762
> WebPage
> 
>


Re: Compression in Cassandra

2011-01-20 Thread akshatbakli...@gmail.com
I just did a du -h DataDump which showed 40G
and du -h CassandraDataDump which showed 170G

am i doing something wrong.
have you observed some compression in it.

On Thu, Jan 20, 2011 at 6:57 PM, Javier Canillas
wrote:

> How do you calculate your 40g data? When you insert it into Cassandra, you
> need to convert the data into a Byte[], maybe your problem is there.
>
>
> On Thu, Jan 20, 2011 at 10:02 AM, akshatbakli...@gmail.com <
> akshatbakli...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am experiencing a unique situation. I loaded some data onto Cassandra.
>> my data was about 40 GB but when loaded to Cassandra the data directory
>> size is almost 170GB.
>>
>> This means the **data got inflated**.
>>
>> Is it the case just with me or some else is also facing the inflation or
>> its the general behavior of Cassandra.
>>
>> I am using Cassandra 0.6.8. on Ubuntu 10.10
>>
>> --
>> Akshat Bakliwal
>> Search Information and Extraction Lab
>> IIIT-Hyderabad
>> 09963885762
>> WebPage
>> 
>>
>
>


-- 
Akshat Bakliwal
Search Information and Extraction Lab
IIIT-Hyderabad
09963885762
WebPage



Re: Use Cassandra to store 2 million records of persons

2011-01-20 Thread Surender Singh
David

Please tell me any solution for it.

Thanks and regards
Surender Singh

On Thu, Jan 20, 2011 at 6:05 PM, David Boxenhorn  wrote:

> Cassandra is not a good solution for data mining type problems, since it
> doesn't have ad-hoc queries. Cassandra is designed to maximize throughput,
> which is not usually a problem for data mining.
>
> On Thu, Jan 20, 2011 at 2:07 PM, Surender Singh wrote:
>
>> Hi All
>>
>> I want to use Apache Cassandra to store information (like first name, last
>> name, gender, address)  about 2 million people.  Then need to perform
>> analytic and reporting on that data.
>> is need to store information about 2 million people in Mysql and then
>> transfer that information into Cassandra.?
>>
>> Please help me as i m new to Apache Cassandra.
>>
>> if you have any use case like that, please share.
>>
>> Thanks and regards
>> Surender Singh
>>
>>
>


How does Bootstrapping work in 0.7 ??

2011-01-20 Thread Patrick de Torcy
Hi,

I've read many, many docs (yes ,
http://wiki.apache.org/cassandra/Operationstoo...) but I still can't
see how bootstrapping work...

I started with one node and put my data in it (16 GB). It's ok.

I added a second node with AutoBootstrap=true (as explained in the doc, I
didn't add this node in the seeds list). In the two nodes, I didn't
specified the initial token.

>From the doc : "If you explicitly specify an InitialToken in the
configuration, the new node will bootstrap to that position on the ring.
Otherwise, it will pick a Token that will give it half the keys from the
node with the most disk space used, that does not already have another node
bootstrapping into its Range."

So the new node should be in charge of about half the keys. When I start the
second none, nothing happens : no data migratation to the new node (it's
data folder is empty...)

Is it supposed to work that way, or have I missed something ?

when I look to the ring, I can see my two nodes (the first one with 90%, the
second one with 10%).

I tried then to put values for initialToken for both nodes (stopping and
restartings the servers), but it didn't change anything : I have the same
token values...

Please help, I'm becoming mad...

thanks,

Patrick


Re: Compression in Cassandra

2011-01-20 Thread Terje Marthinussen
Perfectly normal with 3-7x increase in data size depending on you data schema.

Regards,
Terje

On 20 Jan 2011, at 23:17, "akshatbakli...@gmail.com"  
wrote:

> I just did a du -h DataDump which showed 40G
> and du -h CassandraDataDump which showed 170G
> 
> am i doing something wrong.
> have you observed some compression in it.
> 
> On Thu, Jan 20, 2011 at 6:57 PM, Javier Canillas  
> wrote:
> How do you calculate your 40g data? When you insert it into Cassandra, you 
> need to convert the data into a Byte[], maybe your problem is there.
> 
> 
> On Thu, Jan 20, 2011 at 10:02 AM, akshatbakli...@gmail.com 
>  wrote:
> Hi all,
> 
> I am experiencing a unique situation. I loaded some data onto Cassandra.
> my data was about 40 GB but when loaded to Cassandra the data directory size 
> is almost 170GB.
> 
> This means the **data got inflated**.
> 
> Is it the case just with me or some else is also facing the inflation or its 
> the general behavior of Cassandra.
> 
> I am using Cassandra 0.6.8. on Ubuntu 10.10
> 
> -- 
> Akshat Bakliwal
> Search Information and Extraction Lab
> IIIT-Hyderabad 
> 09963885762
> WebPage
> 
> 
> 
> 
> 
> -- 
> Akshat Bakliwal
> Search Information and Extraction Lab
> IIIT-Hyderabad 
> 09963885762
> WebPage
> 


memory size and disk size prediction tool

2011-01-20 Thread Mimi Aluminium
Hi,

We are implementing a 'middlewear' layer to an underneath storage and
need to estimate costs for various system configurations.
Specifically, I want to estimate the resources (memory, disk) for our
data model.

Is there a tool that  given certain storage configuration parameters,
column family fields number and sizes and other details, and then
workload-dependant  parameters such as read/write average rates etc. can
predict the
resource consumption (i.e, memory, disk)  in an offline mode?

Thanks,
Miriam


Re: Distributed counters

2011-01-20 Thread Nate McCall
On the Hector side, we will be adding this to trunk (and thus moving
Hector trunk to Cassandra 0.8.x) in the next week or two.

On Wed, Jan 19, 2011 at 6:12 PM, Rustam Aliyev  wrote:
> Hi,
>
> Does anyone use CASSANDRA-1072 counters patch with 0.7 stable branch? I need
> this functionality but can't wait until 0.8.
>
> Also, does Hector trunk version has any support for these counters? (this
> question is probably for hector-users group, but most of us anyway here).
>
> Many thanks,
> Rustam Aliyev.
>
>


Re: Under expectation response time for reads

2011-01-20 Thread Miguel Verde
Disable Nagle's algorithm and you should see much better performance.  It
must not be used on loopback.
http://markmail.org/message/rgauuflglwemm24o

On Thu, Jan 20, 2011 at 6:24 AM, George Ciubotaru <
george.ciubot...@weedle.com> wrote:

> Hello,
>
> We are in the process of evaluating Cassandra to be used with our product;
> I've started with some performance tests but unfortunately I'm getting very
> bad results for read operations (around 200 ms per read request which is
> much much more than what I'm reading that Cassandra can deliver).
>
> - I'm using the latest stable Cassandra binaries (Cassandra 0.7) on Windows
> - My cluster has 3 nodes (on 3 separate machines), only one seed node and
> replication factor of 1
> - I've use batch_mutate to insert around 50,000 keys with an average of 60
> columns per key (no super-column)
> - I'm using C# client
> - The read operation I've tested was: for a random key get all its columns
> (using get_slice)
>
> I have 2 types of results:
> - as expected (very fast, around 1 ms per read request) when the client is
> running on one of the 3 machines and is connected with the local machine
> - under expectation (200 ms per request) when the client is running on one
> of the 3 machines but is connected to one of the other 2 machines (except
> local machine).
>
> It might be configuration issue but I cannot figure it out.
>
> Any suggestion?
>
> Thank you,
> George


Cassandra automatic startup script on ubuntu

2011-01-20 Thread Sébastien Druon
Hello!

I am using cassandra on a ubuntu machine and installed it from the binary
found on the cassandra home page.
However, I did not find any scripts to start it up at boot time.

Where can I find this kind of script?

Thanks a lot in advance

Sebastien


Lost MUTATIONS on several Cassandra nodes - no impact on the client

2011-01-20 Thread Oleg Proudnikov
Hi All,

Could you please help me understand the impact of this behaviour?

I am running a 6 node 0.7-rc4 Cassandra cluster with RF=2
6 Hector clients (one per node) are performing single-threaded batch load
running on the same servers. CL=ONE. 

Client performs one simple small query and an insert batch mutation. Each
mutation inserts several dozen columns into 7 column families. Total amount of
data is 10-20KB. It appears that this load is a little bit heavy for the cluster
to handle. I do get an occasional single node OOM.

ISSUE. I see periodic lost mutations on some nodes as shown below. The client
does not receive an exception and the nodes do not go down.

xxx.xxx.xxx.140 grep MUTA log/cassandra.log
xxx.xxx.xxx.141 grep MUTA log/cassandra.log
 WARN [ScheduledTasks:1] 2011-01-18 13:19:03,918 MessagingService.java (line
545) Dropped 227 MUTATION messages in the last 5000ms
 WARN [ScheduledTasks:1] 2011-01-18 13:19:08,924 MessagingService.java (line
545) Dropped 958 MUTATION messages in the last 5000ms
 WARN [ScheduledTasks:1] 2011-01-18 13:52:37,616 MessagingService.java (line
545) Dropped 542 MUTATION messages in the last 5000ms
 WARN [ScheduledTasks:1] 2011-01-18 16:02:27,787 MessagingService.java (line
545) Dropped 273 MUTATION messages in the last 5000ms
xxx.xxx.xxx.142 grep MUTA log/cassandra.log
 WARN [ScheduledTasks:1] 2011-01-17 19:19:06,825 MessagingService.java (line
545) Dropped 699 MUTATION messages in the last 5000ms
 WARN [ScheduledTasks:1] 2011-01-17 19:19:06,860 MessagingService.java (line
545) Dropped 10 READ messages in the last 5000ms
 WARN [ScheduledTasks:1] 2011-01-18 04:01:05,464 MessagingService.java (line
545) Dropped 89 MUTATION messages in the last 5000ms
xxx.xxx.xxx.143 grep MUTA log/cassandra.log
xxx.xxx.xxx.144 grep MUTA log/cassandra.log
xxx.xxx.xxx.145 grep MUTA log/cassandra.log

Q1. Is it possible that Cassandra will drop both replicas for a given column
during these losses? Or does it guarantee that one replica is still written? 

Q2. What does the lack of client exception mean? Does it tell me that at least
one replica is written?

Q3. If I were to use CL=ALL, would I get an exception(s) on the client(s) for
those losses?

Q2. Considering that I did not get an exception I will assume that one replica
is retained. Now, if the nodes stay up and the load on the cluster goes down,
will Cassandra attempt to create 2nd replica? Or will the 2nd replica be created
on a read? Is there a way to recreate lost replicas in batch mode?

Thank you very much,
Oleg




Re: Multi-tenancy, and authentication and authorization

2011-01-20 Thread indika kumara
I do not have a better knowledge about the Cassandra. As per my knowledge,
there is no such a tool. I believe, such a tool would be worth.

Thanks,

Indika

On Thu, Jan 20, 2011 at 6:15 PM, Mimi Aluminium wrote:

> Hi,
>
> I have a question that somewhat related to the above.
> Is there a tool that predicts the resource consumption (i.e, memory, disk,
> CPU)  in an offline mode? Means it is given with the storage conf
> parameters, ks, CFs and data model, and then application parameters such
> read/write average rates. It should output the required sizes for memory,
> disk etc.
>
> I need to estimate costs for various configurations we might have and
> thus  I am working on building "simple" excel  for my own data model  - but
> then it came to my mind to ask wether something like that already exists.
>
> BTW, I think such tool can also help for the issues that were discussed
> before even though it will be built on averages which probably are no so
> fine-grained but it can provide worse cases numbers to the application
> that uses Cassandra
>
> Thanks,
> Miriam
>
>
> ==
> Miriam Allalouf
> n
>
> On Thu, Jan 20, 2011 at 1:53 PM, indika kumara wrote:
>
>> Thanks David We decided to do it at our client-side as the initial
>> implementation. I will investigate the approaches for supporting the fine
>> grained control of the resources consumed by a sever, tenant, and CF.
>>
>> Thanks,
>>
>> Indika
>>
>> On Thu, Jan 20, 2011 at 3:20 PM, David Boxenhorn wrote:
>>
>>> As far as I can tell, if Cassandra supports three levels of configuration
>>> (server, keyspace, column family) we can support multi-tenancy. It is
>>> trivial to give each tenant their own keyspace (e.g. just use the tenant's
>>> id as the keyspace name) and let them go wild. (Any out-of-bounds behavior
>>> on the CF level will be stopped at the keyspace and server level before
>>> doing any damage.)
>>>
>>> I don't think Cassandra needs to know about end-users. From Cassandra's
>>> point of view the tenant is the user.
>>>
>>> On Thu, Jan 20, 2011 at 7:00 AM, indika kumara wrote:
>>>
 +1   Are there JIRAs for these requirements? I would like to contribute
 from my capacity.

 As per my understanding, to support some muti-tenant models, it is
 needed to qualified keyspaces' names, Cfs' names, etc. with the tenant
 namespace (or id). The easiest way to do this would be to modify
 corresponding constructs transparently. I tought of a stage (optional and
 configurable) prior to authorization. Is there any better solutions? I
 appreciate the community's suggestions.

 Moreover, It is needed to send the tenant NS(id) with the user
 credentials (A users belongs to this tenant (org.)). For that purpose, I
 thought of using the user credentials in the AuthenticationRequest. s there
 any better solution?

 I would like to have a MT support at the Cassandra level which is
 optional and configurable.

 Thanks,

 Indika


 On Wed, Jan 19, 2011 at 7:40 PM, David Boxenhorn wrote:

> Yes, the way I see it - and it becomes even more necessary for a
> multi-tenant configuration - there should be completely separate
> configurations for applications and for servers.
>
> - Application configuration is based on data and usage characteristics
> of your application.
> - Server configuration is based on the specific hardware limitations of
> the server.
>
> Obviously, server limitations take priority over application
> configuration.
>
> Assuming that each tenant in a multi-tenant environment gets one
> keyspace, you would also want to enforce limitations based on keyspace
> (which correspond to parameters that the tenant payed for).
>
> So now we have three levels:
>
> 1. Server configuration (top priority)
> 2. Keyspace configuration (payed-for service - second priority)
> 3. Column family configuration (configuration provided by tenant -
> third priority)
>
>
> On Wed, Jan 19, 2011 at 3:15 PM, indika kumara 
> wrote:
>
>> As the actual problem is mostly related to the number of CFs in the
>> system (may be number of the columns), I still believe that supporting
>> exposing the Cassandra ‘as-is’ to a tenant is doable and suitable though
>> need some fixes.  That multi-tenancy model allows a tenant to use the
>> programming model of the Cassandra ‘as-is’, enabling the seamless 
>> migration
>> of an application that uses the Cassandra into the cloud. Moreover, In 
>> order
>> to support different SLA requirements of different tenants, the
>> configurability of keyspaces, cfs, etc., per tenant may be critical.
>> However, there are trade-offs among usability, memory consumption, and
>> performance. I believe that it is important to consider the SLA 
>> requirements
>> of different tenants when decid

Re: Use Cassandra to store 2 million records of persons

2011-01-20 Thread David G. Boney
I don't think the below statement accurately describes data mining or using 
Cassandra for data mining. All the techniques I am familiar with for either 
data mining or machine learning, which data mining is a subset, make one or 
more sequential scans through the data to abstract statistics or build models. 
The question is how well does Cassandra perform with sequential scans through 
the data? The Hadoop model works very well for many machine learning problems 
because it is oriented toward sequential scans through the data. The speed of 
the Hadoop interface to Cassandra would have a lot of bearing on the 
application of Cassandra to data mining or machine learning problems.

-
Sincerely,
David G. Boney
dbon...@semanticartifacts.com
http://www.semanticartifacts.com




On Jan 20, 2011, at 6:35 AM, David Boxenhorn wrote:

> Cassandra is not a good solution for data mining type problems, since it 
> doesn't have ad-hoc queries. Cassandra is designed to maximize throughput, 
> which is not usually a problem for data mining. 
> 
> On Thu, Jan 20, 2011 at 2:07 PM, Surender Singh  wrote:
> Hi All
> 
> I want to use Apache Cassandra to store information (like first name, last
> name, gender, address)  about 2 million people.  Then need to perform
> analytic and reporting on that data.
> is need to store information about 2 million people in Mysql and then
> transfer that information into Cassandra.?
> 
> Please help me as i m new to Apache Cassandra.
> 
> if you have any use case like that, please share.
> 
> Thanks and regards
> Surender Singh
> 
> 



Re: Cassandra automatic startup script on ubuntu

2011-01-20 Thread Donal Zang

On 20/01/2011 17:51, Sébastien Druon wrote:

Hello!

I am using cassandra on a ubuntu machine and installed it from the 
binary found on the cassandra home page.

However, I did not find any scripts to start it up at boot time.

Where can I find this kind of script?

Thanks a lot in advance

Sebastien

Hi, this is what I do, you can add the watchdog to rc.local
/%S[%m]%s %~ %# cat watchdog
#!/bin/bash
#
# This script is to check every $INTERVAL seconds to see
# whether cassandra is work well
# and restart it if neccesary
# by donal 2010-01-11
#
PORT=9160
INTERVAL=2
CASSANDRA=/opt/cassandra
check() {
netstat -tln|grep LISTEN|grep :$1
if [ $? != 0 ]; then
echo "restarting cassandra"
$CASSANDRA/bin/stop-server
sleep 1
$CASSANDRA/bin/start-server
fi
}
while true
  do check $PORT
  sleep $INTERVAL
done/



Configurability of the implementation of the Cassandra.Iface

2011-01-20 Thread indika kumara
Hi all,

Would it be worth the capability of configuring the implementation of the
Cassandra.Iface?. I have to intercept the requests to the Cassandra server
without modifying the existing code (CassandraServer.java). So, the
server-side implementation of the Cassandra.Iface (CassandraServer) need to
be replaced with my own custom implementation. Suggestions are welcome!

Thanks,

Indika


Re: Compression in Cassandra

2011-01-20 Thread Stu Hood
Also note that an improved and compressible file format has been in the
works for a while now.

https://issues.apache.org/jira/browse/CASSANDRA-674

I am endlessly optimistic that it will make it into the 'next' version; in
particular, the current hope is 0.8
 On Jan 20, 2011 6:34 AM, "Terje Marthinussen" 
wrote:
> Perfectly normal with 3-7x increase in data size depending on you data
schema.
>
> Regards,
> Terje
>
> On 20 Jan 2011, at 23:17, "akshatbakli...@gmail.com" <
akshatbakli...@gmail.com> wrote:
>
>> I just did a du -h DataDump which showed 40G
>> and du -h CassandraDataDump which showed 170G
>>
>> am i doing something wrong.
>> have you observed some compression in it.
>>
>> On Thu, Jan 20, 2011 at 6:57 PM, Javier Canillas <
javier.canil...@gmail.com> wrote:
>> How do you calculate your 40g data? When you insert it into Cassandra,
you need to convert the data into a Byte[], maybe your problem is there.
>>
>>
>> On Thu, Jan 20, 2011 at 10:02 AM, akshatbakli...@gmail.com <
akshatbakli...@gmail.com> wrote:
>> Hi all,
>>
>> I am experiencing a unique situation. I loaded some data onto Cassandra.
>> my data was about 40 GB but when loaded to Cassandra the data directory
size is almost 170GB.
>>
>> This means the **data got inflated**.
>>
>> Is it the case just with me or some else is also facing the inflation or
its the general behavior of Cassandra.
>>
>> I am using Cassandra 0.6.8. on Ubuntu 10.10
>>
>> --
>> Akshat Bakliwal
>> Search Information and Extraction Lab
>> IIIT-Hyderabad
>> 09963885762
>> WebPage
>>
>>
>>
>>
>>
>> --
>> Akshat Bakliwal
>> Search Information and Extraction Lab
>> IIIT-Hyderabad
>> 09963885762
>> WebPage
>>


Re: Cassandra automatic startup script on ubuntu

2011-01-20 Thread Dave Viner
You can also use the apt-get repository version, which installs the startup
script.  On http://wiki.apache.org/cassandra/CloudConfig, see the Cassandra
Basic Setup section.  It applies to any debian based machine, not just cloud
instances.

HTH
Dave Viner

On Thu, Jan 20, 2011 at 9:11 AM, Donal Zang  wrote:

>  On 20/01/2011 17:51, Sébastien Druon wrote:
>
> Hello!
>
>  I am using cassandra on a ubuntu machine and installed it from the binary
> found on the cassandra home page.
> However, I did not find any scripts to start it up at boot time.
>
>  Where can I find this kind of script?
>
>  Thanks a lot in advance
>
>  Sebastien
>
> Hi, this is what I do, you can add the watchdog to rc.local
> *%S[%m]%s %~ %# cat watchdog
> #!/bin/bash
> #
> # This script is to check every $INTERVAL seconds to see
> # whether cassandra is work well
> # and restart it if neccesary
> # by donal 2010-01-11
> #
> PORT=9160
> INTERVAL=2
> CASSANDRA=/opt/cassandra
> check() {
> netstat -tln|grep LISTEN|grep :$1
> if [ $? != 0 ]; then
> echo "restarting cassandra"
> $CASSANDRA/bin/stop-server
> sleep 1
> $CASSANDRA/bin/start-server
> fi
> }
> while true
>   do check $PORT
>   sleep $INTERVAL
> done*
>
>


Re: Lost MUTATIONS on several Cassandra nodes - no impact on the client

2011-01-20 Thread Jonathan Ellis
On Thu, Jan 20, 2011 at 10:47 AM, Oleg Proudnikov  wrote:
> Q1. Is it possible that Cassandra will drop both replicas for a given column
> during these losses? Or does it guarantee that one replica is still written?

It guarantees that if the requested ConsistencyLevel is not achieved,
client will get a TimedOutException, which is a signal you need to add
capacity to handle what you are throwing at the cluster.

> Q2. What does the lack of client exception mean? Does it tell me that at least
> one replica is written?

As above.

> Q3. If I were to use CL=ALL, would I get an exception(s) on the client(s) for
> those losses?

Yes.

> Q2. Considering that I did not get an exception I will assume that one replica
> is retained. Now, if the nodes stay up and the load on the cluster goes down,
> will Cassandra attempt to create 2nd replica? Or will the 2nd replica be 
> created
> on a read? Is there a way to recreate lost replicas in batch mode?

http://wiki.apache.org/cassandra/Operations#Repairing_missing_or_inconsistent_data

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


RE: Under expectation response time for reads

2011-01-20 Thread George Ciubotaru
Hi Miguel,



This indeed solved the problem. The response times are now under 1 ms which  is 
great.



Thank you once again,

George

From: Miguel Verde [mailto:miguelitov...@gmail.com]
Sent: 20 January 2011 16:46
To: user@cassandra.apache.org
Subject: Re: Under expectation response time for reads

Disable Nagle's algorithm and you should see much better performance.  It must 
not be used on loopback.
http://markmail.org/message/rgauuflglwemm24o
On Thu, Jan 20, 2011 at 6:24 AM, George Ciubotaru 
mailto:george.ciubot...@weedle.com>> wrote:
Hello,

We are in the process of evaluating Cassandra to be used with our product; I've 
started with some performance tests but unfortunately I'm getting very bad 
results for read operations (around 200 ms per read request which is much much 
more than what I'm reading that Cassandra can deliver).

- I'm using the latest stable Cassandra binaries (Cassandra 0.7) on Windows
- My cluster has 3 nodes (on 3 separate machines), only one seed node and 
replication factor of 1
- I've use batch_mutate to insert around 50,000 keys with an average of 60 
columns per key (no super-column)
- I'm using C# client
- The read operation I've tested was: for a random key get all its columns 
(using get_slice)

I have 2 types of results:
- as expected (very fast, around 1 ms per read request) when the client is 
running on one of the 3 machines and is connected with the local machine
- under expectation (200 ms per request) when the client is running on one of 
the 3 machines but is connected to one of the other 2 machines (except local 
machine).

It might be configuration issue but I cannot figure it out.

Any suggestion?

Thank you,
George



Re: Document Mapper for Ruby?

2011-01-20 Thread Ryan King
Not sure what you mean by document mapper, but CassandraObject might
fit the bill: https://github.com/nzkoz/cassandra_object

-ryan

On Wed, Jan 19, 2011 at 11:03 PM, Joshua Partogi  wrote:
> Hi all,
>
> Is anyone aware of a document mapper for Ruby similar to MongoMapper?
>
> Thanks heaps for your help.
>
> Kind regards,
> Joshua.
> --
> http://twitter.com/jpartogi
>


Re: Cassandra automatic startup script on ubuntu

2011-01-20 Thread Clint Byrum
On Thu, 2011-01-20 at 17:51 +0100, Sébastien Druon wrote:
> Hello!
> 
> 
> I am using cassandra on a ubuntu machine and installed it from the
> binary found on the cassandra home page.
> However, I did not find any scripts to start it up at boot time.
> 
> 
> Where can I find this kind of script?
> 

The debs produced by Eric Evans, and the others that I've been building,
have an init.d script in them.

You can find Eric's debs here:

http://wiki.apache.org/cassandra/DebianPackaging

Or the ones I build and test on Ubuntu releases here:

https://launchpad.net/~cassandra-ubuntu/+archive/stable

I just uploaded 0.7.0, so it will still have 0.6.8 in it while the
builds finish and the debs are published.





Cassandra Ubuntu PPA stable release updated to 0.7.0

2011-01-20 Thread Clint Byrum
For anybody using the cassandra ubuntu stable release PPA, it is being
updated right now to 0.7.0.

This is just a heads up. I'd expect anybody using it to still use all
best practices from the cassandra documentation for upgrades, and not
just blindly apt-get upgrade. But either way, this is a big change and
worth noting.

The 0.7.0 packages are, IMO, of much higher quality than the 0.6
packages. In addition to using system versions of java libraries where
possible (allowing for updates/security fixes from the ubuntu teams to
be applied), it runs the unit tests on the build, so there is at least a
good chance of everything working.

Recently Launchpad has started publishing stats for PPAs, and just as an
FYI, there seem to be about 25 machines regularly subscribing to the
PPA. I'd be interested in hearing any feedback from people who are using
it, or who have chosen not to. Thanks!



Re: How does Bootstrapping work in 0.7 ??

2011-01-20 Thread Peter Schuller
> Is it supposed to work that way, or have I missed something ?

I don't see that you did anything wrong based on your description and
based on my understanding how it works in 0.7 (not sure about 0.6),
but hopefully someone else can address that part. What I can think of
- did you inspect the log on the new node? Does it say anything about
bootstraping or streaming data from other nodes? Does 'nodetool ring'
indicate it considers itself completely up and in the cluster already?

Trying to determine whether the node is in fact considering it self
done bootstrapping and joined to the ring, yet containing no data.

> I tried then to put values for initialToken for both nodes (stopping and
> restartings the servers), but it didn't change anything : I have the same
> token values...

This is expected. Once the node has bootstrapped into the cluster and
saved its token, it will no longer try to acquire a new one. Any
initial token in the configuration is ignored; it is only the
*initial* token, quite literally. Changing the token would require a
'nodetool move' command.

-- 
/ Peter Schuller


Re: Do you have a site in production environment with Cassandra? What client do you use?

2011-01-20 Thread Jean-Yves LEBLEU
Java + Pelops
Cassandra 0.6.8


Re: Distributed counters

2011-01-20 Thread Kelvin Kakugawa
Hi Rustam,

All of our large production clusters are still on 0.6.6.

However, we have an 0.7 branch, here:
https://github.com/kakugawa/cassandra/tree/twttr-cassandra-0.7-counts

that is our migration target.  It passes our internal distributed tests and
will be in production soon.

-Kelvin

On Thu, Jan 20, 2011 at 8:24 AM, Nate McCall  wrote:

> On the Hector side, we will be adding this to trunk (and thus moving
> Hector trunk to Cassandra 0.8.x) in the next week or two.
>
> On Wed, Jan 19, 2011 at 6:12 PM, Rustam Aliyev  wrote:
> > Hi,
> >
> > Does anyone use CASSANDRA-1072 counters patch with 0.7 stable branch? I
> need
> > this functionality but can't wait until 0.8.
> >
> > Also, does Hector trunk version has any support for these counters? (this
> > question is probably for hector-users group, but most of us anyway here).
> >
> > Many thanks,
> > Rustam Aliyev.
> >
> >
>


Re: How does Bootstrapping work in 0.7 ??

2011-01-20 Thread Eric Gilmore
Patrick, if you try adding capacity again from the beginning, I'd be curious
to hear if the 
DataStax/Riptanodocs
are helpful or not.

Also, in the Getting Started
page,
we note that it may be best to set initial_token to 0 on the very first node
that you start.

Regards,

Eric Gilmore

On Thu, Jan 20, 2011 at 11:05 AM, Peter Schuller <
peter.schul...@infidyne.com> wrote:

> > Is it supposed to work that way, or have I missed something ?
>
> I don't see that you did anything wrong based on your description and
> based on my understanding how it works in 0.7 (not sure about 0.6),
> but hopefully someone else can address that part. What I can think of
> - did you inspect the log on the new node? Does it say anything about
> bootstraping or streaming data from other nodes? Does 'nodetool ring'
> indicate it considers itself completely up and in the cluster already?
>
> Trying to determine whether the node is in fact considering it self
> done bootstrapping and joined to the ring, yet containing no data.
>
> > I tried then to put values for initialToken for both nodes (stopping and
> > restartings the servers), but it didn't change anything : I have the same
> > token values...
>
> This is expected. Once the node has bootstrapped into the cluster and
> saved its token, it will no longer try to acquire a new one. Any
> initial token in the configuration is ignored; it is only the
> *initial* token, quite literally. Changing the token would require a
> 'nodetool move' command.
>
> --
> / Peter Schuller
>



-- 
*Eric Gilmore
*
Consulting Technical Writer
Riptano, Inc.
Ph: 510 684 9786  (cell)


Re: How does Bootstrapping work in 0.7 ??

2011-01-20 Thread Robert Coli
On Thu, Jan 20, 2011 at 11:55 AM, Eric Gilmore  wrote:
> Also, in the Getting Started page, we note that it may be best to set
> initial_token to 0 on the very first node that you start.

Could you expand a bit on the reasons for and implications of this,
for our collective elucidation? :)

=Rob


UnserializableColumnFamilyException: Couldn't find cfId

2011-01-20 Thread Oleg Proudnikov
Hi All,

Could you please help me understand the impact on my data?

I am running a 6 node 0.7-rc4 Cassandra cluster with RF=2. Schema was defined
when the cluster was created and did not change. I am doing batch load with
CL=ONE. The cluster is under some stress in memory and I/O. Each node has 1G
heap. CPU is around 10% but the latency is high. 

I saw this exception on 2 out of 6 nodes in a relatively short window of time. 
Hector clients received no exception and the nodes continued running. The
exception has not happened since even though the load is continuing. 
I do get an occasional OOM and I am adjusting thresholds and other 
settings as I go. I also doubled RAM to 2G since the exception.

Here is the exception - the same stack trace in all cases.
org.apache.cassandra.db.UnserializableColumnFamilyException: C
ouldn't find cfId=1004
 at org.apache.cassandra.db.ColumnFamilySerializer.deserialize
(ColumnFamilySerializer.java:117)
 at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps
(RowMutation.java:385)
 at org.apache.cassandra.db.RowMutationSerializer.deserialize
 (RowMutation.java:395)
 at org.apache.cassandra.db.RowMutationSerializer.deserialize
 (RowMutation.java:353)
 at org.apache.cassandra.db.RowMutationVerbHandler.doVerb
(RowMutationVerbHandler.java:52)
 at org.apache.cassandra.net.MessageDeliveryTask.run
 (MessageDeliveryTask.java:63)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)


It refers to two cfIds - cfId=1004 and cfId=1013. Mutation stages are always
different even for the exceptions appearing within the same millisecond.
As you can see below cfId=004 appears on both nodes several times but at
different times while cfId=0013 appears only once on one node.

It happened as a group within one second on one node and in 5 groups spread
across 45 minutes on another node. I left the first log entry of each group.

xxx.xxx.xxx.140 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.141 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.142 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.143 grep -i cfid -B 1 log/cassandra.log


xxx.xxx.xxx.144 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:11] 2011-01-14 15:02:03,911 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004


xxx.xxx.xxx.145 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:1] 2011-01-14 15:02:34,460 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:13] 2011-01-14 15:03:28,637 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:27] 2011-01-14 15:05:02,513 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:4] 2011-01-14 15:12:30,731 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:23] 2011-01-14 15:47:03,416 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1013



Q. What does this mean for the consistency? Am I still within my guarantee of
CL=ONE? 



NOTE: I experienced similar exceptions in 0.7-rc2 but at that time cfIds looked
corrupted. They were random/negative and these exceptions 
were followed by an OOM with an attempt to allocate a huge HeapByteBuffer.

Thank you very much,
Oleg





Cassandra on iSCSI?

2011-01-20 Thread Mick Semb Wever
Does anyone have any experiences with Cassandra on iSCSI?

I'm currently testing a (soon-to-be) production server using both local
raid-5 and iSCSI disks. Our hosting provider is pushing us hard towards
the iSCSI disks because it is easier for them to run (and to meet our
needs for increasing disk capacity overtime).

I'm worried that iSCSI is a non-scalable solution for an otherwise
scalable application (all cassandra nodes will have separate partitions
to the one iSCSI).

To go with raid-5 disks our hosting provider requires proof that iSCSI
won't work. I tried various things (eg `nodetool cleanup` on 12Gb load
giving 5k IOPS) but iSCSI seems to keep up to the performance of the
local raid-5 disks...

Should i be worried about using iSCSI?
Are there better tests i should be running? 

~mck

-- 
"The turtle only makes progress when it's neck is stuck out" Rollo May 
| http://semb.wever.org | http://sesat.no
| http://finn.no   | Java XSS Filter


signature.asc
Description: This is a digitally signed message part


Re: Do you have a site in production environment with Cassandra? What client do you use?

2011-01-20 Thread Jonathan Shook
clients:
 Java and MVEL + Hector
 Perl + thrift

Usage: high-traffic monitoring harness with dynamic mapping and
loading of handlers
Cassandra was part of the "do more with less hardware" approach to
designing this system.


On Fri, Jan 14, 2011 at 11:24 AM, Ertio Lew  wrote:
> Hey,
>
> If you have a site in production environment or considering so, what
> is the client that you use to interact with Cassandra. I know that
> there are several clients available out there according to the
> language you use but I would love to know what clients are being used
> widely in production environments and are best to work with(support
> most required features for performance).
>
> Also preferably tell about the technology stack for your applications.
>
> Any suggestions, comments appreciated ?
>
> Thanks
> Ertio
>


Re: How does Bootstrapping work in 0.7 ??

2011-01-20 Thread Brandon Williams
On Thu, Jan 20, 2011 at 2:14 PM, Robert Coli  wrote:

> On Thu, Jan 20, 2011 at 11:55 AM, Eric Gilmore  wrote:
> > Also, in the Getting Started page, we note that it may be best to set
> > initial_token to 0 on the very first node that you start.
>
> Could you expand a bit on the reasons for and implications of this,
> for our collective elucidation? :)
>

Because then the node never has to move.  Same would be true of 2**127, but
zero is mnemonically easier. :)

-Brandon


Re: Upgrading from 0.6 to 0.7.0

2011-01-20 Thread Aaron Morton
I'm not sure if your suggesting running a mixed mode cluster there, but AFAIK 
the changes to the internode protocol prohibit this. The nodes will probable 
see each either via gossip, but the way the messages define their purpose 
(their verb handler) has been changed.

Out of interest which is more painful, stopping the cluster and upgrading it or 
upgrading your client code?

Aaron

On 21/01/2011, at 12:35 AM, Daniel Josefsson  wrote:

> In our case our replication factor is more than half the number of nodes in 
> the cluster.
> 
> Would it be possible to do the following:
> Upgrade half of them
> Change Thrift Port and inter-server port (is this the storage_port?)
> Start them up
> Upgrade clients one by one
> Upgrade the the rest of the servers
> Or might we get some kind of data collision when still writing to the old 
> cluster as the new storage is being used?
> 
> /Daniel
> 


Re: Embedded Cassandra server startup question

2011-01-20 Thread Aaron Morton
Do you have a full error stack?

That error is raised when the schema is added to an internal static map. There 
is a lot of static state so it's probably going to make your life easier if you 
can avoid reusing the JVM.

Im guessing your errors comes from AbstractCassandraDaemon.setup() calling 
DatabaseDescriptor.loadSchemas() . It may be possible to work around this 
issue, but I don't have time today. Let me know how you get on.

Aaron


On 21/01/2011, at 12:46 AM, Roshan Dawrani  wrote:

> Hi,
> 
> I am using Cassandra for a Grails application and in that I start the 
> embedded server when the Spring application context gets built.
> 
> When I run my Grails app test suite - it first runs the integration and then 
> functional test suite and it builds the application text individually for 
> each phase.
> 
> When it brings the up the embedded Cassandra server in 2nd phase (for 
> functional tests), it fails saying "Attempt to assign id to existing column 
> family."
> 
> Anyone familiar with this error? Is it because both the test phases are 
> executed in the same JVM instance and there is some Cassandra meta-data from 
> phase 1 server start that is affecting the server startup in 2nd phase?
> 
> Any way I can cleanly start the server 2 times in my case? Any other 
> suggestion? Thanks.
> 
> -- 
> Roshan
> Blog: http://roshandawrani.wordpress.com/
> Twitter: @roshandawrani
> Skype: roshandawrani
> 


Re: memory size and disk size prediction tool

2011-01-20 Thread Aaron Morton
Not that I know of, do you have an existing test system you can use as a baseline ? For memory have a read of the JVM Heap Size section here http://wiki.apache.org/cassandra/MemtableThresholdsYou will also want to have some memory for disk caching and the os. 8 or 12gb feels like a good start.For disk capacity I just did some regular old guess work, and multipled my number by 1.25 to cover the on disk overhead. You also want to avoid using more than 50% of the local disk space, due to compaction and the way the disk performance falls away. There is more info available here http://wiki.apache.org/cassandra/CassandraHardwareHow much throughout do you need? How much redundancy do you need? How much data do you plan to store?Hope that helpsAaronOn 21 Jan, 2011,at 05:04 AM, Mimi Aluminium  wrote:Hi,
We are implementing a 'middlewear' layer to an underneath storage andneed to estimate costs for various system configurations.Specifically, I want to estimate the resources (memory, disk) for ourdata model.

Is there a tool that  given certain storage configuration parameters,column family fields number and sizes and other details, and thenworkload-dependant  parameters such as read/write average rates etc. can predict the
resource consumption (i.e, memory, disk)  in an offline mode?
Thanks,Miriam


Re: How does Bootstrapping work in 0.7 ??

2011-01-20 Thread Eric Gilmore
 Sorry, my comments were indeed a little short on elucidation.  :)

The cited doc suggest that setting initial_token to 0 on the first node
"simplifies load balancing as you later expand the cluster . . . .  If this
is unset (the default), Cassandra picks a token number randomly."

A more complete explanation might look something like:

. . . it is recommended to set the initial token's value to zero.  This
simplifies load balancing as you later expand the cluster, since the node
starting at 0 will never need to be moved to a new token.  Also, if this is
unset (the default), Cassandra picks a token number randomly, which can lead
to hot spots in the ring.


On Thu, Jan 20, 2011 at 12:59 PM, Brandon Williams  wrote:

> On Thu, Jan 20, 2011 at 2:14 PM, Robert Coli  wrote:
>
>> On Thu, Jan 20, 2011 at 11:55 AM, Eric Gilmore  wrote:
>> > Also, in the Getting Started page, we note that it may be best to set
>> > initial_token to 0 on the very first node that you start.
>>
>> Could you expand a bit on the reasons for and implications of this,
>> for our collective elucidation? :)
>>
>
> Because then the node never has to move.  Same would be true of 2**127, but
> zero is mnemonically easier. :)
>
> -Brandon
>



-- 
*Eric Gilmore
*
Consulting Technical Writer
Riptano, Inc.
Ph: 510 684 9786  (cell)


RE: How does Bootstrapping work in 0.7 ??

2011-01-20 Thread Jeremiah Jordan
Is 0 really OK?  I remember some bugs coming up recently with a token of
0.  I was thinking about moving my 0 token servers to 1 because of that.

 

-Jeremiah Jordan

 



From: Eric Gilmore [mailto:e...@riptano.com] 
Sent: Thursday, January 20, 2011 1:55 PM
To: user@cassandra.apache.org
Subject: Re: How does Bootstrapping work in 0.7 ??

 

Patrick, if you try adding capacity again from the beginning, I'd be
curious to hear if the DataStax/Riptano

docs are helpful or not.

Also, in the Getting Started page
 , we note that
it may be best to set initial_token to 0 on the very first node that you
start.  

Regards,

Eric Gilmore

On Thu, Jan 20, 2011 at 11:05 AM, Peter Schuller
 wrote:

> Is it supposed to work that way, or have I missed something ?

I don't see that you did anything wrong based on your description and
based on my understanding how it works in 0.7 (not sure about 0.6),
but hopefully someone else can address that part. What I can think of
- did you inspect the log on the new node? Does it say anything about
bootstraping or streaming data from other nodes? Does 'nodetool ring'
indicate it considers itself completely up and in the cluster already?

Trying to determine whether the node is in fact considering it self
done bootstrapping and joined to the ring, yet containing no data.


> I tried then to put values for initialToken for both nodes (stopping
and
> restartings the servers), but it didn't change anything : I have the
same
> token values...

This is expected. Once the node has bootstrapped into the cluster and
saved its token, it will no longer try to acquire a new one. Any
initial token in the configuration is ignored; it is only the
*initial* token, quite literally. Changing the token would require a
'nodetool move' command.

--
/ Peter Schuller




-- 

Eric Gilmore

Consulting Technical Writer

Riptano, Inc.

Ph: 510 684 9786  (cell)

 



Re: UnserializableColumnFamilyException: Couldn't find cfId

2011-01-20 Thread Aaron Morton
Sounds like there are multiple versions of your schema around the cluster. What client API are you using? Does it support the describe_schema_versions() function? This will tell you how many versions there are. The easy solutions here is scrub the data and start a new 0.7 cluster using the release version.If possible you should not use data created in the non release versions once you get to production. Hope that helps. AaronOn 21 Jan, 2011,at 09:15 AM, Oleg Proudnikov  wrote:Hi All,

Could you please help me understand the impact on my data?

I am running a 6 node 0.7-rc4 Cassandra cluster with RF=2. Schema was defined
when the cluster was created and did not change. I am doing batch load with
CL=ONE. The cluster is under some stress in memory and I/O. Each node has 1G
heap. CPU is around 10% but the latency is high. 

I saw this exception on 2 out of 6 nodes in a relatively short window of time. 
Hector clients received no exception and the nodes continued running. The
exception has not happened since even though the load is continuing. 
I do get an occasional OOM and I am adjusting thresholds and other 
settings as I go. I also doubled RAM to 2G since the exception.

Here is the exception - the same stack trace in all cases.
org.apache.cassandra.db.UnserializableColumnFamilyException: C
ouldn't find cfId=1004
 at org.apache.cassandra.dbColumnFamilySerializer.deserialize
(ColumnFamilySerializer.java:117)
 at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps
(RowMutation.java:385)
 at org.apache.cassandra.db.RowMutationSerializer.deserialize
 (RowMutation.java:395)
 at org.apache.cassandra.db.RowMutationSerializer.deserialize
 (RowMutation.java:353)
 at org.apache.cassandra.db.RowMutationVerbHandler.doVerb
(RowMutationVerbHandler.java:52)
 at org.apache.cassandra.net.MessageDeliveryTask.run
 (MessageDeliveryTask.java:63)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)


It refers to two cfIds - cfId=1004 and cfId=1013. Mutation stages are always
different even for the exceptions appearing within the same millisecond.
As you can see below cfId=004 appears on both nodes several times but at
different times while cfId=0013 appears only once on one node.

It happened as a group within one second on one node and in 5 groups spread
across 45 minutes on another node. I left the first log entry of each group.

xxx.xxx.xxx.140 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.141 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.142 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.143 grep -i cfid -B 1 log/cassandra.log


xxx.xxx.xxx.144 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:11] 2011-01-14 15:02:03,911 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004


xxx.xxx.xxx.145 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:1] 2011-01-14 15:02:34,460 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:13] 2011-01-14 15:03:28,637 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:27] 2011-01-14 15:05:02,513 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:4] 2011-01-14 15:12:30,731 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:23] 2011-01-14 15:47:03,416 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1013



Q. What does this mean for the consistency? Am I still within my guarantee of
CL=ONE? 



NOTE: I experienced similar exceptions in 0.7-rc2 but at that time cfIds looked
corrupted. They were random/negative and these exceptions 
were followed by an OOM with an attempt to allocate a huge HeapByteBuffer.

Thank you very much,
Oleg





Re: Embedded Cassandra server startup question

2011-01-20 Thread Anand Somani
Here is what worked for me, I use testNg, and initialize and createschema in
the @BeforeClass for each test

   - In the @AfterClass, I had to drop schema, otherwise I was getting the
   same exception.
   - After this I started getting port conflict with the second test, so I
   added my own version of EmbeddedCass.. class, added a stop which calls a
   stop on the cassandradaemon (which from code comments seems to closes the
   thrift port)


On Thu, Jan 20, 2011 at 1:32 PM, Aaron Morton wrote:

> Do you have a full error stack?
>
> That error is raised when the schema is added to an internal static map.
> There is a lot of static state so it's probably going to make your life
> easier if you can avoid reusing the JVM.
>
> Im guessing your errors comes from AbstractCassandraDaemon.setup() calling
> DatabaseDescriptor.loadSchemas() . It may be possible to work around this
> issue, but I don't have time today. Let me know how you get on.
>
> Aaron
>
>
> On 21/01/2011, at 12:46 AM, Roshan Dawrani 
> wrote:
>
> Hi,
>
> I am using Cassandra for a Grails application and in that I start the
> embedded server when the Spring application context gets built.
>
> When I run my Grails app test suite - it first runs the integration and
> then functional test suite and it builds the application text individually
> for each phase.
>
> When it brings the up the embedded Cassandra server in 2nd phase (for
> functional tests), it fails saying "*Attempt to assign id to existing
> column family.*"
>
> Anyone familiar with this error? Is it because both the test phases are
> executed in the same JVM instance and there is some Cassandra meta-data from
> phase 1 server start that is affecting the server startup in 2nd phase?
>
> Any way I can cleanly start the server 2 times in my case? Any other
> suggestion? Thanks.
>
> --
> Roshan
> Blog: 
> http://roshandawrani.wordpress.com/
> Twitter: @roshandawrani 
> Skype: roshandawrani
>
>


Re: How does Bootstrapping work in 0.7 ??

2011-01-20 Thread Jonathan Ellis
It's okay as of 0.6.10 and 0.7.0.

But the bug only affected range queries, and you'd know if you'd hit
it because there would be really obvious exception messages in your
log.

In other words it's probably not necessary to move your nodes.

On Thu, Jan 20, 2011 at 4:30 PM, Jeremiah Jordan
 wrote:
> Is 0 really OK?  I remember some bugs coming up recently with a token of 0.
> I was thinking about moving my 0 token servers to 1 because of that.
>
>
>
> -Jeremiah Jordan
>
>
>
> 
>
> From: Eric Gilmore [mailto:e...@riptano.com]
> Sent: Thursday, January 20, 2011 1:55 PM
> To: user@cassandra.apache.org
> Subject: Re: How does Bootstrapping work in 0.7 ??
>
>
>
> Patrick, if you try adding capacity again from the beginning, I'd be curious
> to hear if the DataStax/Riptano docs are helpful or not.
>
> Also, in the Getting Started page, we note that it may be best to set
> initial_token to 0 on the very first node that you start.
>
> Regards,
>
> Eric Gilmore
>
> On Thu, Jan 20, 2011 at 11:05 AM, Peter Schuller
>  wrote:
>
>> Is it supposed to work that way, or have I missed something ?
>
> I don't see that you did anything wrong based on your description and
> based on my understanding how it works in 0.7 (not sure about 0.6),
> but hopefully someone else can address that part. What I can think of
> - did you inspect the log on the new node? Does it say anything about
> bootstraping or streaming data from other nodes? Does 'nodetool ring'
> indicate it considers itself completely up and in the cluster already?
>
> Trying to determine whether the node is in fact considering it self
> done bootstrapping and joined to the ring, yet containing no data.
>
>> I tried then to put values for initialToken for both nodes (stopping and
>> restartings the servers), but it didn't change anything : I have the same
>> token values...
>
> This is expected. Once the node has bootstrapped into the cluster and
> saved its token, it will no longer try to acquire a new one. Any
> initial token in the configuration is ignored; it is only the
> *initial* token, quite literally. Changing the token would require a
> 'nodetool move' command.
>
> --
> / Peter Schuller
>
>
> --
>
> Eric Gilmore
>
> Consulting Technical Writer
>
> Riptano, Inc.
>
> Ph: 510 684 9786  (cell)
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Does Major Compaction work on dropped CFs? Doesn't seem so.

2011-01-20 Thread buddhasystem

Greetings,

I just used teh nodetool to force a major compaction on my cluster. It seems
like the cfs currently in service were indeed compacted, while the old test
materials (which I dropped from CLI) were still there as tombstones.

Is that the expected behavior? Hmm...

TIA.

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Does-Major-Compaction-work-on-dropped-CFs-Doesn-t-seem-so-tp5946031p5946031.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Does Major Compaction work on dropped CFs? Doesn't seem so.

2011-01-20 Thread Aaron Morton
I think the abandoned sstables resulting from dropping a CF are handled the same as SSTables left over after compaction. They are deleted as part of a full GC.See the section on Compaction here http://wiki.apache.org/cassandra/MemtableSSTableYou can trigger GC via JConsole. Hope that helps AaronOn 21 Jan, 2011,at 01:42 PM, buddhasystem  wrote:
Greetings,

I just used teh nodetool to force a major compaction on my cluster. It seems
like the cfs currently in service were indeed compacted, while the old test
materials (which I dropped from CLI) were still there as tombstones.

Is that the expected behavior? Hmm...

TIA.

-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Does-Major-Compaction-work-on-dropped-CFs-Doesn-t-seem-so-tp5946031p5946031.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.


Re: Document Mapper for Ruby?

2011-01-20 Thread Joshua Partogi
Thanks Ryan.

This is what I am looking for. Let me try it out.



On Fri, Jan 21, 2011 at 4:58 AM, Ryan King  wrote:

> Not sure what you mean by document mapper, but CassandraObject might
> fit the bill: https://github.com/nzkoz/cassandra_object
>
> -ryan
>
> On Wed, Jan 19, 2011 at 11:03 PM, Joshua Partogi 
> wrote:
> > Hi all,
> >
> > Is anyone aware of a document mapper for Ruby similar to MongoMapper?
> >
> > Thanks heaps for your help.
> >
> > Kind regards,
> > Joshua.
> > --
> > http://twitter.com/jpartogi
> >
>



-- 
http://twitter.com/jpartogi


Re: Does Major Compaction work on dropped CFs? Doesn't seem so.

2011-01-20 Thread buddhasystem

Thanks!

What's strange anyhow is that the GC period for these cfs expired some days
ago. I thought that a compaction would take care of these tombstones. I used
nodetool to "compact".

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Does-Major-Compaction-work-on-dropped-CFs-Doesn-t-seem-so-tp5946031p5946231.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Embedded Cassandra server startup question

2011-01-20 Thread Roshan Dawrani
On Fri, Jan 21, 2011 at 3:02 AM, Aaron Morton wrote:

> Do you have a full error stack?
>
> That error is raised when the schema is added to an internal static map.
> There is a lot of static state so it's probably going to make your life
> easier if you can avoid reusing the JVM.
>
>
Hi Aaron,

Actually it is not my primary requirement to start the Embedded server twice
in the same JVM. The requirement is to have the empty column families before
each test so that changes made in tests do not affect each other.

Keeping a single instance of the embedded server up across test phases, what
would be the most efficient way to clean-up the CFs between tests?

I have around 10 CFs and not too much data is generated in each test, so
right now, I collect all keys from CFs and then fire a batch query to delete
them.

Can I improve on that clean-up process between tests?

Im guessing your errors comes from AbstractCassandraDaemon.setup() calling
> DatabaseDescriptor.loadSchemas() .
>

I start the embedded server using EmbeddedServerHelper@setup(). I am not
directly dealing with AbstractCassandraDaemon.setup(). I guess that all
happens inside EmbeddedServerHelper.


Re: Embedded Cassandra server startup question

2011-01-20 Thread Roshan Dawrani
On Fri, Jan 21, 2011 at 5:14 AM, Anand Somani  wrote:

> Here is what worked for me, I use testNg, and initialize and createschema
> in the @BeforeClass for each test
>
>- In the @AfterClass, I had to drop schema, otherwise I was getting the
>same exception.
>- After this I started getting port conflict with the second test, so I
>added my own version of EmbeddedCass.. class, added a stop which calls a
>stop on the cassandradaemon (which from code comments seems to closes the
>thrift port)
>
> How was this clean-up experience, Anand? Shutting down the cassandra daemon
and droping and creating schema between tests? Sounds like something that
could be time consuming.

I am currently firing all-deletes on all my CFs and am looking for more
efficient ways to have data cleaned-up between tests.

Thanks.


Re: Embedded Cassandra server startup question

2011-01-20 Thread Aaron Morton
There is a truncate() function that will clear a CF. It may leave a snapshot around, cannot remember exactly. Or you could drop and recreate the keyspace between tests using system_add_keyspace() and system_drop_keyspace(). The system tests in the test/system/__init__.py sort of do this. AaronOn 21 Jan, 2011,at 03:16 PM, Roshan Dawrani  wrote:On Fri, Jan 21, 2011 at 3:02 AM, Aaron Morton  wrote:
Do you have a full error stack?That error is raised when the schema is added to an internal static map. There is a lot of static state so it's probably going to make your life easier if you can avoid reusing the JVM.
Hi Aaron,Actually it is not my primary requirement to start the Embedded server twice in the same JVM. The requirement is to have the empty column families before each test so that changes made in tests do not affect each other.
Keeping a single instance of the embedded server up across test phases, what would be the most efficient way to clean-up the CFs between tests?I have around 10 CFs and not too much data is generated in each test, so right now, I collect all keys from CFs and then fire a batch query to delete them.
Can I improve on that clean-up process between tests?
Im guessing your errors comes from AbstractCassandraDaemon.setup() calling DatabaseDescriptor.loadSchemas() .I start the embedded server using EmbeddedServerHelper@setup(). I am not directly dealing with AbstractCassandraDaemon.setup(). I guess that all happens inside EmbeddedServerHelper.



Re: Embedded Cassandra server startup question

2011-01-20 Thread Roshan Dawrani
On Fri, Jan 21, 2011 at 8:07 AM, Aaron Morton wrote:

> There is a truncate() function that will clear a CF. It may leave a
> snapshot around, cannot remember exactly.
>
> Or you could drop and recreate the keyspace between tests using
> system_add_keyspace() and system_drop_keyspace(). The system tests in the
> test/system/__init__.py sort of do this.
>

Thanks Aaron. I will checkout both the options. If the existing system tests
there are adding / dropping keyspace between tests, maybe it is not a very
expensive operation after all.

At the minimum, I can replace my clean-up with truncate() calls.

Thanks a lot.


Re: Embedded Cassandra server startup question

2011-01-20 Thread Roshan Dawrani
On Fri, Jan 21, 2011 at 8:07 AM, Aaron Morton wrote:

> There is a truncate() function that will clear a CF. It may leave a
> snapshot around, cannot remember exactly.
>

Not sure if Hector (0.7.0-22) has added truncate() to its API yet. I can't
find it.

In Hector, I see a *dropColumnFamily()* that goes to Cassandra's *
system_drop_column_family()* call.

I am not sure how this system_drop_column_family() fares in comparision to
truncate() in terms of time the clean-up would take.

I am new to Hector/Cass and all my exposure to Cass API has been through
Hector. So a basic question.

If Hector has not provided truncate() to its API, can I bypass it and make
the call to Cassandra API directly? Does Hector leave any opening for such
bypassed calls?

Thanks.


Re: Embedded Cassandra server startup question

2011-01-20 Thread Maxim Potekhin

You can script the actions you need and pipe the file into Cassandra-CLI.
Works for me.

On 1/20/2011 10:18 PM, Roshan Dawrani wrote:
On Fri, Jan 21, 2011 at 8:07 AM, Aaron Morton > wrote:


There is a truncate() function that will clear a CF. It may leave
a snapshot around, cannot remember exactly.


Not sure if Hector (0.7.0-22) has added truncate() to its API yet. I 
can't find it.


In Hector, I see a _dropColumnFamily()_ that goes to Cassandra's 
_system_drop_column_family()_ call.


I am not sure how this system_drop_column_family() fares in 
comparision to truncate() in terms of time the clean-up would take.


I am new to Hector/Cass and all my exposure to Cass API has been 
through Hector. So a basic question.


If Hector has not provided truncate() to its API, can I bypass it and 
make the call to Cassandra API directly? Does Hector leave any opening 
for such bypassed calls?


Thanks.




Re: Embedded Cassandra server startup question

2011-01-20 Thread Roshan Dawrani
On Fri, Jan 21, 2011 at 8:52 AM, Maxim Potekhin  wrote:

>  You can script the actions you need and pipe the file into Cassandra-CLI.
> Works for me.
>

Thanks Maxim,  but first preference will be to do it through the API and not
launch the Cassandra-CLI process with a scripted set of actions (I assume
that is what your suggestion meant)

truncate() may work best for me, if I can get it working through Hector API
that I already use.


Re: Embedded Cassandra server startup question

2011-01-20 Thread Roshan Dawrani
On Fri, Jan 21, 2011 at 8:56 AM, Roshan Dawrani wrote:

> On Fri, Jan 21, 2011 at 8:52 AM, Maxim Potekhin  wrote:
>
>>  You can script the actions you need and pipe the file into Cassandra-CLI.
>> Works for me.
>>
>
>
Probably CliMain / CliClient will help me there doing it as per your
suggestion.

Still would like to confirm if I cannot do it through Hector API at this
point of time, when there  is no direct Hector API call for truncate().
Anyway I can still reach Cassandra's truncate() call?

Thanks.


Re: Cassandra on iSCSI?

2011-01-20 Thread Jonathan Ellis
On Thu, Jan 20, 2011 at 2:13 PM, Mick Semb Wever  wrote:
> To go with raid-5 disks our hosting provider requires proof that iSCSI
> won't work. I tried various things (eg `nodetool cleanup` on 12Gb load
> giving 5k IOPS) but iSCSI seems to keep up to the performance of the
> local raid-5 disks...
>
> Should i be worried about using iSCSI?

It should work fine; the main reason to go with local storage is the
huge cost advantage.

Of course with a SAN you'd want RF=1 since it's replicating internally.

> Are there better tests i should be running?

I would test write scalability going from 1 machine, to half your
planned cluster size, to your full cluster size, or as close as is
feasible, using enough client machines running contrib/stress* (much
faster than contrib/py_stress) that you saturate it.

Writes should be CPU bound, so you expect those to scale roughly
linearly as you add Cassandra nodes.

Reads (once your data set can't be cached in RAM) will be i/o bound,
so I imagine with a SAN you'll be able to max that out at some number
of machines and adding more Cassandra nodes won't help.  What that
limit is depends on your SAN iops and how much of it is being consumed
by other applications.

*I just committed a README for contrib/stress to the 0.7 svn branch

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Does Major Compaction work on dropped CFs? Doesn't seem so.

2011-01-20 Thread Jonathan Ellis
obsolete sstables are not the same thing as tombstones.

On Thu, Jan 20, 2011 at 8:11 PM, buddhasystem  wrote:
>
> Thanks!
>
> What's strange anyhow is that the GC period for these cfs expired some days
> ago. I thought that a compaction would take care of these tombstones. I used
> nodetool to "compact".
>
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Does-Major-Compaction-work-on-dropped-CFs-Doesn-t-seem-so-tp5946031p5946231.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Embedded Cassandra server startup question

2011-01-20 Thread Roshan Dawrani
Back to square one on using CliMain/CliClient vs Cassandra/Hector API for
cleanuup.

It seems CliClient uses Antlr 3.1+ for parsing the statements passed to it,
but I am using Grails that uses Antlr 2.7.7 (used by groovy code parsing),
so I can't mix the two for programmatic use.

Someone please tell how I can truncate my column families in my Hector based
environment? Does it expose a thrift Cassandra.Client somewhere so I can
make calls that its API does not cover yet?

Thanks.

On Fri, Jan 21, 2011 at 9:12 AM, Roshan Dawrani wrote:

> On Fri, Jan 21, 2011 at 8:56 AM, Roshan Dawrani 
> wrote:
>
>> On Fri, Jan 21, 2011 at 8:52 AM, Maxim Potekhin  wrote:
>>
>>>  You can script the actions you need and pipe the file into
>>> Cassandra-CLI.
>>> Works for me.
>>>
>>
>>
> Probably CliMain / CliClient will help me there doing it as per your
> suggestion.
>
> Still would like to confirm if I cannot do it through Hector API at this
> point of time, when there  is no direct Hector API call for truncate().
> Anyway I can still reach Cassandra's truncate() call?
>
> Thanks.
>



-- 
Roshan
Blog: http://roshandawrani.wordpress.com/
Twitter: @roshandawrani 
Skype: roshandawrani


Re: Cassandra on iSCSI?

2011-01-20 Thread Mick Semb Wever
> It should work fine; the main reason to go with local storage is the
> huge cost advantage.

[OT] They're quoting roughly the same price for both (claiming that the
extra cost goes into having for each node a separate disk cabinet to run
local raid-5).

> *I just committed a README for contrib/stress to the 0.7 svn branch 

thanks! i'll check it out.

~mck

-- 
“An invasion of armies can be resisted, but not an idea whose time has
come.” - Victor Hugo 
| www.semb.wever.org | www.sesat.no 
| www.finn.no | http://xss-http-filter.sf.net


signature.asc
Description: This is a digitally signed message part


Re: Embedded Cassandra server startup question

2011-01-20 Thread Roshan Dawrani
Ok, got a Cassandra client from Hector and changed my clean-up to be
truncate() based.

Here is how I did it, if it could be any use to anyone:

=
HConnectionManager connectionManager = cassandraCluster.connectionManager
Collection activePools =
connectionManager.activePools

ConcurrentHClientPool pool = activePools.iterator().next()
HThriftClient client = pool.borrowClient()

Cassandra.Client c = client.getCassandra()
c.set_keyspace(keyspaceName)

cfsToTrucate.each {cf ->
c.truncate(cf)
}
=

Thanks to everyone who shared their inputs.

-- 
Roshan
Blog: http://roshandawrani.wordpress.com/
Twitter: @roshandawrani 
Skype: roshandawrani

On Fri, Jan 21, 2011 at 10:35 AM, Roshan Dawrani wrote:

> Back to square one on using CliMain/CliClient vs Cassandra/Hector API for
> cleanuup.
>
> It seems CliClient uses Antlr 3.1+ for parsing the statements passed to it,
> but I am using Grails that uses Antlr 2.7.7 (used by groovy code parsing),
> so I can't mix the two for programmatic use.
>
> Someone please tell how I can truncate my column families in my Hector
> based environment? Does it expose a thrift Cassandra.Client somewhere so I
> can make calls that its API does not cover yet?
>
> Thanks.
>
> On Fri, Jan 21, 2011 at 9:12 AM, Roshan Dawrani 
> wrote:
>
>> On Fri, Jan 21, 2011 at 8:56 AM, Roshan Dawrani 
>> wrote:
>>
>>> On Fri, Jan 21, 2011 at 8:52 AM, Maxim Potekhin wrote:
>>>
  You can script the actions you need and pipe the file into
 Cassandra-CLI.
 Works for me.

>>>
>>>
>> Probably CliMain / CliClient will help me there doing it as per your
>> suggestion.
>>
>> Still would like to confirm if I cannot do it through Hector API at this
>> point of time, when there  is no direct Hector API call for truncate().
>> Anyway I can still reach Cassandra's truncate() call?
>>
>> Thanks.
>>
>
>
>
> --
> Roshan
> Blog: http://roshandawrani.wordpress.com/
> Twitter: @roshandawrani 
> Skype: roshandawrani
>
>