Re: 200TB in Cassandra ?

2012-04-20 Thread Franc Carter
On Fri, Apr 20, 2012 at 6:27 AM, aaron morton wrote:

> Couple of ideas:
>
> * take a look at compression in 1.X
> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression
> * is there repetition in the binary data ? Can you save space by
> implementing content addressable storage ?
>

The data is already very highly space optimised. We've come to the
conclusion that Cassandra is probably not the right fit the use case this
time

cheers


>
> Cheers
>
>
>   -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 20/04/2012, at 12:55 AM, Dave Brosius wrote:
>
>  I think your math is 'relatively' correct. It would seem to me you should
> focus on how you can reduce the amount of storage you are using per item,
> if at all possible, if that node count is prohibitive.
>
> On 04/19/2012 07:12 AM, Franc Carter wrote:
>
>
>  Hi,
>
>  One of the projects I am working on is going to need to store about
> 200TB of data - generally in manageable binary chunks. However, after doing
> some rough calculations based on rules of thumb I have seen for how much
> storage should be on each node I'm worried.
>
>200TB with RF=3 is 600TB = 600,000GB
>   Which is 1000 nodes at 600GB per node
>
>  I'm hoping I've missed something as 1000 nodes is not viable for us.
>
>  cheers
>
>  --
> *Franc Carter* | Systems architect | Sirca Ltd
>  
> franc.car...@sirca.org.au | www.sirca.org.au
> Tel: +61 2 9236 9118
>  Level 9, 80 Clarence St, Sydney NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
>
>


-- 

*Franc Carter* | Systems architect | Sirca Ltd
 

franc.car...@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215


RE: default required in cassandra-topology.properties?

2012-04-20 Thread Richard Lowe
As far as I know it's not possible to leave replication factor undefined - if 
you do then Cassandra will default to RF=1 with SimpleStrategy.

The topology is local to each node, so unless all your nodes have the same 
topology file then it's possible for them each to have a different idea about 
the topology of the cluster.

I'm not sure what you're trying to achieve here, so I'll give an example.

Say you have two datacenters, DC1 and DC2. It's perfectly possible for nodes in 
DC1 to have a topology file that only mentions DC1 nodes and nodes in DC2 to 
have a topology file that only mentions DC2 nodes. You can then define one 
keyspace with strategy options DC1: 3 and another with DC2: 3 and this should 
work fine.

However if you had a keyspace with strategy options DC1: 3, DC2: 3 then you 
would AFAIK never be able to write to that column family because none of the 
nodes know enough about the topology; they can either address DC1, or address 
DC2, but not both.

If there were a third type of node that had topology defined for both DC1 and 
DC2 then these nodes would then be able to update the DC1+DC2 keyspace, even 
though DC1-only and DC2-only nodes would not.

So if there is a clear segregation in your data then splitting the topology may 
be OK, but if not then you will likely find that you can't update the keyspace 
unless a node has sufficient knowledge of the topology.

Depending on your use case a simpler alternative may be to just run two 
clusters instead of trying to define the shape of a single one through topology 
definitions. I think what you're talking about here is on the edge of what 
Cassandra is designed to do; it works best when all nodes are uniform and have 
the same understanding about the cluster.

Richard


From: Bill Au [mailto:bill.w...@gmail.com]
Sent: 19 April 2012 19:58
To: user@cassandra.apache.org
Subject: Re: default required in cassandra-topology.properties?

I had thought that the topology file is used for replicas placement only such 
that for the token range that the unknown node is responsible for, data is 
still read and write there.  It just won't be replicated since replication 
factor is not defined.

Bill
On Thu, Apr 19, 2012 at 1:18 PM, Richard Lowe 
mailto:richard.l...@arkivum.com>> wrote:
Yes it is possible. Put the following as the last line of your topology file:

default=unknown:unknown

So long as you don't have any DC or rack with this name your local node will 
not be able to address any nodes that aren't explicitly given in its topology 
file.

However bear in mind that, whilst Cassandra won't try to use replication factor 
to store to these 'unknown' nodes, their token may mean that the 'natural' home 
for a row is on a node that is not addressable. This can create holes in your 
dataset and create situations where data can 'disappear' because the bloom 
filter says the data is on a particular node (due to its token) but the 
coordinator can't contact that node to get at the data.

Careful use of replication factor and NetworkTopologyStrategy can help with 
this, but you should make sure that a node really doesn't need to contact the 
unknown nodes before marking them as such.


Richard


From: Bill Au [mailto:bill.w...@gmail.com]
Sent: 19 April 2012 17:16
To: user@cassandra.apache.org
Subject: default required in cassandra-topology.properties?

All the examples of cassandra-topology.properties that I have seen have a 
default entry assigning unknown nodes to a specific data center and rack.  Is 
it possible to have Cassandra ignore unknown nodes for the purpose of 
replication?

Bill



Re: User authorized for cannot create CFs

2012-04-20 Thread Michal Michalski
Thanks for your reply, problem is solved. First, I missunderstood the 
modify-keyspace param and then I just missed the fact that I can do simply:


test.=operator

without any wildcards or so. I even tried this solution before and - 
after looking into the source code - I was sure it just has to work, but 
it failed baceuse of some other, unrelated error in our app, which I 
missed before.



AFAIK the SimpleAuthenticator, and to some degree authentication (?), has been 
essentially deprecated as it was considered incomplete and was not under 
development. This is why the SimpleAuthenticator was moved out to the examples 
directory in 1.X. I doubt it will be dropped, but (again AFIK) it is not under 
active development.


Yup, I know it. But we do not use it as a main way of securing our data 
or so - we just want to (1) separate the access from different parts of 
the system using different users for them and (2) protect ourselves from 
accidental writes to improper keyspaces or so. Thus we don't need it to 
be working perfectly - I'd say it's a bit like Windows95 login prompt, 
which could be closed with Esc button ;) Anyway, even if we threat this 
way, we need it to use it comfortably, so it would be a problem for us 
to change SimpleAuthenticator to AllowAllAuthority in cassandra.yaml 
every time we create/update CF's :)


Anyway, it works now and thanks for your reply :)

MichaƂ



Re: 200TB in Cassandra ?

2012-04-20 Thread Jake Luciani
What other solutions are you considering?  Any OLTP style access of 200TB
of data will require substantial IO.

Do you know how big your working dataset will be?

-Jake

On Fri, Apr 20, 2012 at 3:30 AM, Franc Carter wrote:

> On Fri, Apr 20, 2012 at 6:27 AM, aaron morton wrote:
>
>> Couple of ideas:
>>
>> * take a look at compression in 1.X
>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression
>> * is there repetition in the binary data ? Can you save space by
>> implementing content addressable storage ?
>>
>
> The data is already very highly space optimised. We've come to the
> conclusion that Cassandra is probably not the right fit the use case this
> time
>
> cheers
>
>
>>
>> Cheers
>>
>>
>>   -
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 20/04/2012, at 12:55 AM, Dave Brosius wrote:
>>
>>  I think your math is 'relatively' correct. It would seem to me you
>> should focus on how you can reduce the amount of storage you are using per
>> item, if at all possible, if that node count is prohibitive.
>>
>> On 04/19/2012 07:12 AM, Franc Carter wrote:
>>
>>
>>  Hi,
>>
>>  One of the projects I am working on is going to need to store about
>> 200TB of data - generally in manageable binary chunks. However, after doing
>> some rough calculations based on rules of thumb I have seen for how much
>> storage should be on each node I'm worried.
>>
>>200TB with RF=3 is 600TB = 600,000GB
>>   Which is 1000 nodes at 600GB per node
>>
>>  I'm hoping I've missed something as 1000 nodes is not viable for us.
>>
>>  cheers
>>
>>  --
>> *Franc Carter* | Systems architect | Sirca Ltd
>>  
>> franc.car...@sirca.org.au | www.sirca.org.au
>> Tel: +61 2 9236 9118
>>  Level 9, 80 Clarence St, Sydney NSW 2000
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>>
>>
>>
>
>
> --
>
> *Franc Carter* | Systems architect | Sirca Ltd
>  
>
> franc.car...@sirca.org.au | www.sirca.org.au
>
> Tel: +61 2 9236 9118
>
> Level 9, 80 Clarence St, Sydney NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
>


-- 
http://twitter.com/tjake


Long type column names in reverse order

2012-04-20 Thread Tarun Gupta
Hi,

My requirements is to get retrieve column values, sorted by column names in
reverse order (column names are 'long' type). The way I am trying to
implement this is by using a custom comparator. I have written the custom
comparator by using 'org.apache.cassandra.db.marshal.BytesType' and
altering the compare() method. While inserting values it works fine but
while retrieving the values I am getting
"ColumnSerializer$CorruptColumnException".

I've attached the Comparator class. Please suggest what should I change to
make it work.

Regards
Tarun


ReverseColumnComparator.java
Description: Binary data


Re: Long type column names in reverse order

2012-04-20 Thread Edward Capriolo
I think you can drop the compiler since that feature already exists.

http://thelastpickle.com/2011/10/03/Reverse-Comparators/


On Fri, Apr 20, 2012 at 12:57 PM, Tarun Gupta
 wrote:
> Hi,
>
> My requirements is to get retrieve column values, sorted by column names in
> reverse order (column names are 'long' type). The way I am trying to
> implement this is by using a custom comparator. I have written the custom
> comparator by using 'org.apache.cassandra.db.marshal.BytesType' and altering
> the compare() method. While inserting values it works fine but while
> retrieving the values I am getting
> "ColumnSerializer$CorruptColumnException".
>
> I've attached the Comparator class. Please suggest what should I change to
> make it work.
>
> Regards
> Tarun


Re: Two Random Ports in Private port range

2012-04-20 Thread Kirk True
Are these the dynamic JMX ports?

Sent from my iPad

On Apr 19, 2012, at 8:58 AM, W F  wrote:

> Hi All,
> 
> I did a web search of the archives (hope I looked in the right place) and 
> could not find a request like this.
> 
> When Cassandra is running, it seems to create to random tcp listen ports.
> 
> For example: "50378 and 58692", "49952, 52792".
> 
> What are are these for and is there documentation regarding this?
> 
> Sorry if this is already in the archive!
> 
> Thanks ~A


Re: Two Random Ports in Private port range

2012-04-20 Thread W F
Yes, they are.

What are they used for and are they specifically documented somewhere?

Thanks!

On Fri, Apr 20, 2012 at 11:25 AM, Kirk True  wrote:

> Are these the dynamic JMX ports?
>
> Sent from my iPad
>
> On Apr 19, 2012, at 8:58 AM, W F  wrote:
>
> Hi All,
>
> I did a web search of the archives (hope I looked in the right place) and
> could not find a request like this.
>
> When Cassandra is running, it seems to create to random tcp listen ports.
>
> For example: "50378 and 58692", "49952, 52792".
>
> What are are these for and is there documentation regarding this?
>
> Sorry if this is already in the archive!
>
> Thanks ~A
>
>


AUTO: Ken Robbins is out of the office

2012-04-20 Thread Ken Robbins


I am out of the office until 04/21/2012.

I will be out of the office and away from a computer for most of Friday
(4/20). For urgent operational issues (including anything customer
affecting), please send me a text at 781-856-0078.



Note: This is an automated response to your message  "Re: 200TB in
Cassandra ?" sent on 04/20/2012 9:05:21.

This is the only notification you will receive while this person is away.

Help with Wide Rows with CounterColumns

2012-04-20 Thread Praveen Baratam
Hello All,

I have a particular requirement where I need to update CounterColumns in a
Row by a specific UID which is the key for the CounterColumn in that row
and then query for those columns in that Row such that we get the top 5
UIDs with highest Counter Values.

create column family Counters
with comparator = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and default_validation_class = 'CounterColumnType';

Can it be done?


Re: Long type column names in reverse order

2012-04-20 Thread Tarun Gupta
Thanks, This post helped a lot, I discovered that the built-in comparators
have a static instance called *reverseComparator.* My exact requirement was
to create an API that allows creating a Column family with the required
parameters, one such parameter was a flag that indicates the column order.
I am using Hector API for this purpose. The way I finally solved this is as
follows :

*public class ReverseColumnComparator extends  AbstractType {*
* *
* private static Comparator otherInstance =
BytesType.instance.reverseComparator ;*
* *
* public static final ReverseColumnComparator instance = new
ReverseColumnComparator();*
* *
* @Override*
* public int compare(ByteBuffer o1, ByteBuffer o2) {*
* return otherInstance.compare(o1, o2);*
* }*
* @Override*
* public ByteBuffer compose(ByteBuffer arg0) {*
* return BytesType.instance.compose(arg0);*
* }*
* @Override*
* public ByteBuffer decompose(ByteBuffer arg0) {*
* return BytesType.instance.decompose(arg0);*
* }*
* @Override*
* public String getString(ByteBuffer arg0) {*
* return BytesType.instance.getString(arg0);*
* }*
* @Override*
* public void validate(ByteBuffer arg0) throws MarshalException {*
* BytesType.instance.validate(arg0);*
* }*
*}*

Regards,
Tarun


On Fri, Apr 20, 2012 at 11:46 PM, Edward Capriolo wrote:

> I think you can drop the compiler since that feature already exists.
>
> http://thelastpickle.com/2011/10/03/Reverse-Comparators/
>
>
> On Fri, Apr 20, 2012 at 12:57 PM, Tarun Gupta
>  wrote:
> > Hi,
> >
> > My requirements is to get retrieve column values, sorted by column names
> in
> > reverse order (column names are 'long' type). The way I am trying to
> > implement this is by using a custom comparator. I have written the custom
> > comparator by using 'org.apache.cassandra.db.marshal.BytesType' and
> altering
> > the compare() method. While inserting values it works fine but while
> > retrieving the values I am getting
> > "ColumnSerializer$CorruptColumnException".
> >
> > I've attached the Comparator class. Please suggest what should I change
> to
> > make it work.
> >
> > Regards
> > Tarun
>


Kundera 2.0.6 Released

2012-04-20 Thread Vivek Mishra
Hi All,

We are happy to announce release of Kundera 2.0.6.

Kundera is a JPA 2.0 based, object-datastore papping library for NoSQL
datastores. The idea behind Kundera is to make working with NoSQL Databases
drop-dead simple and fun. It currently supports Cassandra, HBase, MongoDB
and relational databases.

Major Changes in this release:
---
* HBase 0.90.x migration.
* Enhanced Persistence Context.
* Named and Native queries support (including CQL support for cassandra)
* UPDATE and DELETE queries support.
* DDL auto-schema creation.
* Performance improvements.


To download, use or contribute to Kundera, visit:
http://github.com/impetus-opensource/Kundera
Latest released tag version is 2.0.6. Kundera maven libraries are now
available at:
https://oss.sonatype.org/content/repositories/releases/com/impetus

Sample codes and examples for using Kundera can be found here:
http://github.com/impetus-opensource/Kundera-Examples

Thank you all for your contributions!

Regards,
Kundera Team.


Release: OpsCenter 2.0

2012-04-20 Thread Nick Bailey
Hey everyone,

This past week we released OpsCenter 2.0.

The main updates to the community version of this release are centered
around performance and stability. These updates should greatly improve
performance, especially in larger clusters. Some of the highlights
include:

* Refactoring of metric processing. Metrics are written directly from
OpsCenter agents, which causes less network traffic and load in
general.
* Refactoring of metric storage. This should improve disk space
consumption of metrics written by OpsCenter.
* Update of gc_grace_seconds on OpsCenter column families. This will
allow unused historical OpsCenter data to expire earlier.
* UI performance enhancements.
* Bug fixes.

For anyone that hasn't used OpsCenter, it is a graphical tool for
Cassandra administration and management. The community version is free
for any use.

You can download the OpsCenter 2.0 tarball directly here:
http://downloads.datastax.com/community/, and find instructions for
installing the tarball as well as deb or rpm versions of OpsCenter
here: http://www.datastax.com/docs/opscenter2.0/install_opscenter

Please send us any feedback or issues you have so we can continue to
improve OpsCenter.

-Nick


Re: 200TB in Cassandra ?

2012-04-20 Thread Franc Carter
On Sat, Apr 21, 2012 at 1:05 AM, Jake Luciani  wrote:

> What other solutions are you considering?  Any OLTP style access of 200TB
> of data will require substantial IO.


We currently use an in-house written database because when we first started
our system there was nothing that handled our problem economically. We
would like to use something more off the shelf to reduce maintenance and
development costs.

We've been looking at Hadoop for the computational component. However it
looks like HDFS does not map to our storage patterns well as the latency is
quite high. In addition the resilience model of the Name Node is a concern
in our environment.

We were thinking through whether using Cassandra for the Hadoop data store
is viable for us, however we've come to the conclusion that it doesn't map
well in this case.


>
> Do you know how big your working dataset will be?
>

The system is batch, jobs could range between very small up to a moderate
percentage of the data set. It' even possible that we could need to read
the entire data set. How much we get resident is a cost/performance
trade-off we need to make

cheers


>
> -Jake
>
>
> On Fri, Apr 20, 2012 at 3:30 AM, Franc Carter 
> wrote:
>
>> On Fri, Apr 20, 2012 at 6:27 AM, aaron morton wrote:
>>
>>> Couple of ideas:
>>>
>>> * take a look at compression in 1.X
>>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression
>>> * is there repetition in the binary data ? Can you save space by
>>> implementing content addressable storage ?
>>>
>>
>> The data is already very highly space optimised. We've come to the
>> conclusion that Cassandra is probably not the right fit the use case this
>> time
>>
>> cheers
>>
>>
>>>
>>> Cheers
>>>
>>>
>>>   -
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 20/04/2012, at 12:55 AM, Dave Brosius wrote:
>>>
>>>  I think your math is 'relatively' correct. It would seem to me you
>>> should focus on how you can reduce the amount of storage you are using per
>>> item, if at all possible, if that node count is prohibitive.
>>>
>>> On 04/19/2012 07:12 AM, Franc Carter wrote:
>>>
>>>
>>>  Hi,
>>>
>>>  One of the projects I am working on is going to need to store about
>>> 200TB of data - generally in manageable binary chunks. However, after doing
>>> some rough calculations based on rules of thumb I have seen for how much
>>> storage should be on each node I'm worried.
>>>
>>>200TB with RF=3 is 600TB = 600,000GB
>>>   Which is 1000 nodes at 600GB per node
>>>
>>>  I'm hoping I've missed something as 1000 nodes is not viable for us.
>>>
>>>  cheers
>>>
>>>  --
>>> *Franc Carter* | Systems architect | Sirca Ltd
>>>  
>>> franc.car...@sirca.org.au | www.sirca.org.au
>>> Tel: +61 2 9236 9118
>>>  Level 9, 80 Clarence St, Sydney NSW 2000
>>> PO Box H58, Australia Square, Sydney NSW 1215
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Franc Carter* | Systems architect | Sirca Ltd
>>  
>>
>> franc.car...@sirca.org.au | www.sirca.org.au
>>
>> Tel: +61 2 9236 9118
>>
>> Level 9, 80 Clarence St, Sydney NSW 2000
>>
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>>
>
>
> --
> http://twitter.com/tjake
>



-- 

*Franc Carter* | Systems architect | Sirca Ltd
 

franc.car...@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215


Cql 3 wide rows filter expressions in where clause

2012-04-20 Thread Nagaraj J
Hi

cql 3 for wide rows is very promising. I was wondering if there is support
for filtering wide rows by additional filter expressions in where clause
(columns other than those which are part of the composite). 

Ex.
suppose i have sparse cf 

create columnfamily scf( k ascii, o ascii, x ascii, y ascii, z ascii,
PRIMARY KEY(k, o));

is it possible to have a query

select * from scf where k=1 and x=2 and z=2 order by o ASC;

I tried this with 1.1-rc and it doesnt work as expected. Also looked at
cql_tests.py in https://issues.apache.org/jira/browse/CASSANDRA-2474  there
is no mention of this. 

Am i missing something here ?

Thanks in advance
Nagaraj 

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cql-3-wide-rows-filter-expressions-in-where-clause-tp7486344p7486344.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.