date:20100426

Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Roland Hänel

I have a configuration like this:

  
  /storage01/cassandra/data
  /storage02/cassandra/data
  /storage03/cassandra/data
  

After loading a big chunk of data into cassandra, I end up wich some 70GB in
the first directory, and only about 10GB in the second and third one. All
rows are quite small, so it's not just some big rows that contain the
majority of data.

Does Cassandra have the ability to 'see' the maximum available space in
these directory? I'm asking myself this question since my limit is 100GB,
and the first directory is approaching this limit...

And, wouldn't it be better if Cassandra tried to 'load-balance' the files
inside the directories because this will result in better (read) performance
if the directories are on different disks (which is the case for me)?

Any help is appreciated.

Roland

Re: how to store file in the cassandra?

2010-04-26 Thread dir dir

Hi Jonathan,

Cassandra seems has not a Blob data type. To handle binary large object
data,
we have to use array of byte. I have a question to you. Suppose I have a
MPEG
video files 15 MB. To save this video file into Cassandra database I will
store
this file into array of byte. One day, I feel this video is not necessary
again,
therefore I delete it from the database. My question is, after I delete this
video from Cassandra database, should I perform defragmentation operation
into Cassandra's file database??

Thank you.


On Mon, Apr 26, 2010 at 8:28 AM, Jonathan Ellis  wrote:

> Cassandra stores byte arrays.  You can certainly store file data in
> it, although if it is larger than a few MB you should chunk it into
> multiple columns.
>
> On Sun, Apr 25, 2010 at 8:21 PM, Shuge Lee  wrote:
> > Yes.
> >
> > Cassandra does save raw string data only, not a file, and shouldn't save
> a
> > file.
> >
> > 2010/4/26 刘兵兵 
> >>
> >> sorry i'm not very familiar with python, are you meaning that the files
> >> are stored in the file system of the os?
> >>
> >> then , the cassandra just stores the path to access the files?
> >>
> >>
> >> On Mon, Apr 26, 2010 at 8:57 AM, Shuge Lee  wrote:
> >>>
> >>> In Python:
> >>>
> >>> keyspace.columnfamily[key][column] = value
> >>>
> >>> files.video[uuid.uuid4()]['name'] = 'foo.flv'
> >>> files.video[uuid.uuid4()]['path'] = '/var/files/foo.flv'
> >>>
> >>> create a mapping
> >>> files.video =  {
> >>> uuid.uuid4() : {
> >>> 'name' : 'foo.flv',
> >>> 'path' : '/var/files/foo.flv',
> >>> }
> >>> }
> >>>
> >>> if most of sizes >= 0.5MB, use sys-fs/reiser4progs, else use ext4.
> >>>
> >>>
> >>> 2010/4/26 Bingbing Liu 
> 
>  any suggestion?
> 
>  2010-04-26
>  
>  Bingbing Liu
> >>>
> >>>
> >>> --
> >>> Shuge Lee | Lee Li | 李蠡
> >>
> >>
> >>
> >> --
> >> Bingbing Liu
> >>
> >> Web and Mobile Data Management lab
> >>
> >> Renmin University  of  China
> >
> >
> >
> > --
> > Shuge Lee | Lee Li | 李蠡
> >
>

Re: when i use the OrderPreservingPartition, the load is very imbalance

2010-04-26 Thread Roland Hänel

1) you can re-balance a node with

 bin/nodetool -h  token []

specify a new token manually or let the system guess one.

2) take a look into your system.log to find out why your nodes are dying.


2010/4/26 刘兵兵 

> i do some INSERT ,because i will do some scan operations, i use the
> OrderPreservingPartition method.
>
> the state of the cluster is showed below.
>
> as i predicated the load is very imbalance, and some of the nodes down (in
> some nodes,the Cassandra processes died and in others the processes are
>
> alive but they still down),
>
> so i have two questions:
>
> 1)how to do balance after the insert ends?
>
> 2)why the nodes died? how to make them up again (when the situation is that
> the process is alive but the node'state is down)
>
> thx
>
> 10.37.17.241  Up 47.65 GB
> 0p6ovvUXMJ4cdd1L   |<--|
> 10.37.17.234  Up 67.41 GB
> 5OxiS2DKBZLeISPg  |   ^
> 10.37.17.235  Up 67.54 GB
> 7UDcS0SToePuQACe v   |
> 10.37.17.246  Up 555 bytes
> OCvC3nqKLeKA5n0I   |   ^
> 10.37.17.233  Up 830 bytes
> SJp6cQRNox52av2Y   v   |
> 10.37.17.249  Up 830 bytes
> SxVmCVcruOpoS48B   |   ^
> 10.37.17.247  Up 555 bytes
> TGctCMvfNuRo7RjS   v   |
> 10.37.17.245  Up 555 bytes
> j2smY0OOtQ0SeeHY   |   ^
> 10.37.17.250  Up 830 bytes
> jNwBPchW58i5tGxp   v   |
> 10.37.17.248  Up 830 bytes
> jYWaJC93OyMdWDaN   |   ^
> 10.37.17.237  Up 830 bytes
> mPwhLOsKlbPart6j   v   |
> 10.37.17.236  Up 830 bytes
> noh0t8HJgw4hmz7I   |   ^
> 10.37.17.244  Up 555 bytes
> q8c8SPYEkWEzmFcR   v   |
> 10.37.17.238  Up 555 bytes
> rIuuq3AR4DVK989X   |   ^
> 10.37.17.242  Up 555 bytes
> smebTmIvQBMG56Zf   v   |
> 10.37.17.243  Up 555 bytes
> tWTYyiqAKQVw7197   |   ^
> 10.37.17.232  Up 830 bytes
> uVdBQkR9Dszm5deK   v   |
> 10.37.17.239  Up 555 bytes
> xXQkDQn1vvg8e1xS   |   ^
> 10.37.17.240  Up 555 bytes
> yQRrq9RG2dUsHUyR   |-->|
>
>
> --
> Bingbing Liu
>
> Web and Mobile Data Management lab
>
> Renmin University  of  China
>

Re: how to store file in the cassandra?

2010-04-26 Thread Mark Robson

On 26 April 2010 00:57, Shuge Lee  wrote:

> In Python:
>
> keyspace.columnfamily[key][column] = value
>
> files.video[uuid.uuid4()]['name'] = 'foo.flv'
> files.video[uuid.uuid4()]['path'] = '/var/files/foo.flv'
>

Hi.

Storing the filename in the database will not solve the file storage
problem. Cassandra is a distributed database, and a file stored locally will
not be available on other client nodes.

If you're using Cassandra at all, that probably implies that you have lots
of client nodes. A non-redundant NFS server (for example) would not offer
high availability, so would be inadequate for the OP's situation.

Storing files *IN* Cassandra is very useful because you can then retrieve
them from anywhere with high availability.

However, as others have discussed, they should be split across multiple
columns, or if very big, multiple rows.

I prefer to split by row because this scales better to very large files.
During compaction, as is well noted, Cassandra needs the entire row in
memory, which will cause a FAIL  once you have files more than a few gigs.

Mark

Re: when i use the OrderPreservingPartition, the load is very imbalance

2010-04-26 Thread Roland Hänel

sorry, if specifying the token manually, use:

  bin/nodetool -h  move 


2010/4/26 Roland Hänel 

> 1) you can re-balance a node with
>
>  bin/nodetool -h  token []
>
> specify a new token manually or let the system guess one.
>
> 2) take a look into your system.log to find out why your nodes are dying.
>
>
> 2010/4/26 刘兵兵 
>
> i do some INSERT ,because i will do some scan operations, i use the
>> OrderPreservingPartition method.
>>
>> the state of the cluster is showed below.
>>
>> as i predicated the load is very imbalance, and some of the nodes down (in
>> some nodes,the Cassandra processes died and in others the processes are
>>
>> alive but they still down),
>>
>> so i have two questions:
>>
>> 1)how to do balance after the insert ends?
>>
>> 2)why the nodes died? how to make them up again (when the situation is
>> that the process is alive but the node'state is down)
>>
>> thx
>>
>> 10.37.17.241  Up 47.65 GB
>> 0p6ovvUXMJ4cdd1L   |<--|
>> 10.37.17.234  Up 67.41 GB
>> 5OxiS2DKBZLeISPg  |   ^
>> 10.37.17.235  Up 67.54 GB
>> 7UDcS0SToePuQACe v   |
>> 10.37.17.246  Up 555 bytes
>> OCvC3nqKLeKA5n0I   |   ^
>> 10.37.17.233  Up 830 bytes
>> SJp6cQRNox52av2Y   v   |
>> 10.37.17.249  Up 830 bytes
>> SxVmCVcruOpoS48B   |   ^
>> 10.37.17.247  Up 555 bytes
>> TGctCMvfNuRo7RjS   v   |
>> 10.37.17.245  Up 555 bytes
>> j2smY0OOtQ0SeeHY   |   ^
>> 10.37.17.250  Up 830 bytes
>> jNwBPchW58i5tGxp   v   |
>> 10.37.17.248  Up 830 bytes
>> jYWaJC93OyMdWDaN   |   ^
>> 10.37.17.237  Up 830 bytes
>> mPwhLOsKlbPart6j   v   |
>> 10.37.17.236  Up 830 bytes
>> noh0t8HJgw4hmz7I   |   ^
>> 10.37.17.244  Up 555 bytes
>> q8c8SPYEkWEzmFcR   v   |
>> 10.37.17.238  Up 555 bytes
>> rIuuq3AR4DVK989X   |   ^
>> 10.37.17.242  Up 555 bytes
>> smebTmIvQBMG56Zf   v   |
>> 10.37.17.243  Up 555 bytes
>> tWTYyiqAKQVw7197   |   ^
>> 10.37.17.232  Up 830 bytes
>> uVdBQkR9Dszm5deK   v   |
>> 10.37.17.239  Up 555 bytes
>> xXQkDQn1vvg8e1xS   |   ^
>> 10.37.17.240  Up 555 bytes
>> yQRrq9RG2dUsHUyR   |-->|
>>
>>
>> --
>> Bingbing Liu
>>
>> Web and Mobile Data Management lab
>>
>> Renmin University  of  China
>>
>
>

Re: when i use the OrderPreservingPartition, the load is very imbalance

2010-04-26 Thread Mark Robson

On 26 April 2010 01:18, 刘兵兵  wrote:

> i do some INSERT ,because i will do some scan operations, i use the
> OrderPreservingPartition method.
>
> the state of the cluster is showed below.
>
> as i predicated the load is very imbalance

I think the solution to this would be to choose your nodes' tokens wisely
before you start inserting data, and if possible, modify the keys to split
them better between the nodes.

For example, if your key has two parts, one of which you want to range scan,
another which you don't. Say you have customer_id and a timestamp. The
customer ID does not need to be range scanned, so you can hash it into a hex
value (say), then append the timestamp (in a lexically sortable way of
course). So you'd end up with keys like

-0012345-0001234567890

Where  is a hash of the customer ID, 0012345 is the customer ID, and the
rest is a timestamp.

You'd be able to do a time range scan by using the known prefixes, and
distributing your nodes equally from  to  would result in fairly
even data (provided you don't have a very small number of very large
customers).

Mark

Re: when i use the OrderPreservingPartition, the load is very imbalance

2010-04-26 Thread Schubert Zhang

When starting your cassandra cluster, please configure the InitialToken for
each node, which make the key range balance.

On Mon, Apr 26, 2010 at 6:17 PM, Mark Robson  wrote:

> On 26 April 2010 01:18, 刘兵兵  wrote:
>
>> i do some INSERT ,because i will do some scan operations, i use the
>> OrderPreservingPartition method.
>>
>> the state of the cluster is showed below.
>>
>> as i predicated the load is very imbalance
>
>
>
> I think the solution to this would be to choose your nodes' tokens wisely
> before you start inserting data, and if possible, modify the keys to split
> them better between the nodes.
>
> For example, if your key has two parts, one of which you want to range
> scan, another which you don't. Say you have customer_id and a timestamp. The
> customer ID does not need to be range scanned, so you can hash it into a hex
> value (say), then append the timestamp (in a lexically sortable way of
> course). So you'd end up with keys like
>
> -0012345-0001234567890
>
> Where  is a hash of the customer ID, 0012345 is the customer ID, and
> the rest is a timestamp.
>
> You'd be able to do a time range scan by using the known prefixes, and
> distributing your nodes equally from  to  would result in fairly
> even data (provided you don't have a very small number of very large
> customers).
>
> Mark
>

Re: Re: when i use the OrderPreservingPartition, the load is veryimbalance

2010-04-26 Thread Bingbing Liu

thank you so much for your help!


2010-04-26 



Bingbing Liu 



发件人： Mark Robson 
发送时间： 2010-04-26  18:17:53 
收件人： user 
抄送： 
主题： Re: when i use the OrderPreservingPartition, the load is veryimbalance 
 
On 26 April 2010 01:18, 刘兵兵  wrote:

i do some INSERT ,because i will do some scan operations, i use the 
OrderPreservingPartition method.

the state of the cluster is showed below.

as i predicated the load is very imbalance




I think the solution to this would be to choose your nodes' tokens wisely 
before you start inserting data, and if possible, modify the keys to split them 
better between the nodes.


For example, if your key has two parts, one of which you want to range scan, 
another which you don't. Say you have customer_id and a timestamp. The customer 
ID does not need to be range scanned, so you can hash it into a hex value 
(say), then append the timestamp (in a lexically sortable way of course). So 
you'd end up with keys like 


-0012345-0001234567890


Where  is a hash of the customer ID, 0012345 is the customer ID, and the 
rest is a timestamp.


You'd be able to do a time range scan by using the known prefixes, and 
distributing your nodes equally from  to  would result in fairly even 
data (provided you don't have a very small number of very large customers).


Mark

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Schubert Zhang

Please refer the code:

org.apache.cassandra.db.ColumnFamilyStore

public String getFlushPath()
{
long guessedSize = 2 * DatabaseDescriptor.getMemtableThroughput() *
1024*1024; // 2* adds room for keys, column indexes
String location =
DatabaseDescriptor.getDataFileLocationForTable(table_, guessedSize);
if (location == null)
throw new RuntimeException("Insufficient disk space to flush");
return new File(location,
getTempSSTableFileName()).getAbsolutePath();
}

and we can go through org.apache.cassandra.config.DatabaseDescriptor:

public static String getDataFileLocationForTable(String table, long
expectedCompactedFileSize)
{
  long maxFreeDisk = 0;
  int maxDiskIndex = 0;
  String dataFileDirectory = null;
  String[] dataDirectoryForTable =
getAllDataFileLocationsForTable(table);

  for ( int i = 0 ; i < dataDirectoryForTable.length ; i++ )
  {
File f = new File(dataDirectoryForTable[i]);
if( maxFreeDisk < f.getUsableSpace())
{
  maxFreeDisk = f.getUsableSpace();
  maxDiskIndex = i;
}
  }
  // Load factor of 0.9 we do not want to use the entire disk that is
too risky.
  maxFreeDisk = (long)(0.9 * maxFreeDisk);
  if( expectedCompactedFileSize < maxFreeDisk )
  {
dataFileDirectory = dataDirectoryForTable[maxDiskIndex];
currentIndex = (maxDiskIndex + 1 )%dataDirectoryForTable.length ;
  }
  else
  {
currentIndex = maxDiskIndex;
  }
return dataFileDirectory;
}

So, DataFileDirectories means multiple disks or disk-partitions.
I think your storage01, storage02 and storage03 are in same disk or disk
partition.


2010/4/26 Roland Hänel 

> I have a configuration like this:
>
>   
>   /storage01/cassandra/data
>   /storage02/cassandra/data
>   /storage03/cassandra/data
>   
>
> After loading a big chunk of data into cassandra, I end up wich some 70GB
> in the first directory, and only about 10GB in the second and third one. All
> rows are quite small, so it's not just some big rows that contain the
> majority of data.
>
> Does Cassandra have the ability to 'see' the maximum available space in
> these directory? I'm asking myself this question since my limit is 100GB,
> and the first directory is approaching this limit...
>
> And, wouldn't it be better if Cassandra tried to 'load-balance' the files
> inside the directories because this will result in better (read) performance
> if the directories are on different disks (which is the case for me)?
>
> Any help is appreciated.
>
> Roland
>
>

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Roland Hänel

Thanks very much. Precisely answers my questions. :-)

2010/4/26 Schubert Zhang 

> Please refer the code:
>
> org.apache.cassandra.db.ColumnFamilyStore
>
> public String getFlushPath()
> {
> long guessedSize = 2 * DatabaseDescriptor.getMemtableThroughput() *
> 1024*1024; // 2* adds room for keys, column indexes
> String location =
> DatabaseDescriptor.getDataFileLocationForTable(table_, guessedSize);
> if (location == null)
> throw new RuntimeException("Insufficient disk space to flush");
> return new File(location,
> getTempSSTableFileName()).getAbsolutePath();
> }
>
> and we can go through org.apache.cassandra.config.DatabaseDescriptor:
>
> public static String getDataFileLocationForTable(String table, long
> expectedCompactedFileSize)
> {
>   long maxFreeDisk = 0;
>   int maxDiskIndex = 0;
>   String dataFileDirectory = null;
>   String[] dataDirectoryForTable =
> getAllDataFileLocationsForTable(table);
>
>   for ( int i = 0 ; i < dataDirectoryForTable.length ; i++ )
>   {
> File f = new File(dataDirectoryForTable[i]);
> if( maxFreeDisk < f.getUsableSpace())
> {
>   maxFreeDisk = f.getUsableSpace();
>   maxDiskIndex = i;
> }
>   }
>   // Load factor of 0.9 we do not want to use the entire disk that is
> too risky.
>   maxFreeDisk = (long)(0.9 * maxFreeDisk);
>   if( expectedCompactedFileSize < maxFreeDisk )
>   {
> dataFileDirectory = dataDirectoryForTable[maxDiskIndex];
> currentIndex = (maxDiskIndex + 1 )%dataDirectoryForTable.length ;
>   }
>   else
>   {
> currentIndex = maxDiskIndex;
>   }
> return dataFileDirectory;
> }
>
> So, DataFileDirectories means multiple disks or disk-partitions.
> I think your storage01, storage02 and storage03 are in same disk or disk
> partition.
>
>
> 2010/4/26 Roland Hänel 
>
> I have a configuration like this:
>>
>>   
>>   /storage01/cassandra/data
>>   /storage02/cassandra/data
>>   /storage03/cassandra/data
>>   
>>
>> After loading a big chunk of data into cassandra, I end up wich some 70GB
>> in the first directory, and only about 10GB in the second and third one. All
>> rows are quite small, so it's not just some big rows that contain the
>> majority of data.
>>
>> Does Cassandra have the ability to 'see' the maximum available space in
>> these directory? I'm asking myself this question since my limit is 100GB,
>> and the first directory is approaching this limit...
>>
>> And, wouldn't it be better if Cassandra tried to 'load-balance' the files
>> inside the directories because this will result in better (read) performance
>> if the directories are on different disks (which is the case for me)?
>>
>> Any help is appreciated.
>>
>> Roland
>>
>>
>

Re: when i use the OrderPreservingPartition, the load is very imbalance

2010-04-26 Thread Lucas Di Pentima

Hello Mark,

El 26/04/2010, a las 07:17, Mark Robson escribió:

> I think the solution to this would be to choose your nodes' tokens wisely 
> before you start inserting data, and if possible, modify the keys to split 
> them better between the nodes.
> 
> For example, if your key has two parts, one of which you want to range scan, 
> another which you don't. Say you have customer_id and a timestamp. The 
> customer ID does not need to be range scanned, so you can hash it into a hex 
> value (say), then append the timestamp (in a lexically sortable way of 
> course). So you'd end up with keys like 
> 
> -0012345-0001234567890
> 
> Where  is a hash of the customer ID, 0012345 is the customer ID, and the 
> rest is a timestamp.
> 
> You'd be able to do a time range scan by using the known prefixes, and 
> distributing your nodes equally from  to  would result in fairly even 
> data (provided you don't have a very small number of very large customers).


How do you ask cassandra to do a range scan with a prefix? As far as I can 
tell, you can't do something like:

db.get_range('SomeCF', :start => '-0012345-*')

...do you?


Regards
--
Lucas Di Pentima - Santa Fe, Argentina
Jabber: lu...@di-pentima.com.ar
MSN: ldipent...@hotmail.com

Re: value size, is there a suggested limit?

2010-04-26 Thread Schubert Zhang

I think that is not what cassandra good at.

On Mon, Apr 26, 2010 at 4:22 AM, Mark Greene  wrote:

> http://wiki.apache.org/cassandra/CassandraLimitations
>
>
> On Sun, Apr 25, 2010 at 4:19 PM, S Ahmed  wrote:
>
>> Is there a suggested sized maximum that you can set the value of a given
>> key?
>>
>> e.g. could I convert a document to bytes and store it as a value to a key?
>>  if yes, which I presume so, what if the file is 10mb? or 100mb?
>>
>
>

Cassandra use cases: as a datagrid ? as a distributed cache ?

2010-04-26 Thread Dominique De Vito



Hi,

Cassandra comes closer and closer to a data grid like Oracle Coherence: 
Cassandra includes distributed "hash maps", partitioning, high 
availability, map/reduce processing, (some) request capability, etc.


So, I am wondering about the 2 following (and possible ?) Cassandra's 
use cases :


(1) has anyone already used Cassandra as an in-memory data grid ?
If no, does anyone know how far such a database is from, let's say, 
Oracle Coherence ?
Does Cassandra provide, for example, a (synchronized) cache on the 
client side ?


(2) has anyone already used Cassandra as a distributed cache ?
Are there some testimonials somewhere about this use case ?

Thanks for your help.

Regards,
Dominique

Re: MapReduce, Timeouts and Range Batch Size

2010-04-26 Thread Jonathan Ellis

OPP will be marginally faster.  Maybe 10%?  I don't think anyone has
benchmarked it.

On Fri, Apr 23, 2010 at 10:30 AM, Joost Ouwerkerk  wrote:
> In that case I should probably wait for 0.7.  Is there any fundamental
> performance difference in get_range_slices between Random and
> Order-Preserving partitioners.  If so, by what factor?
> joost.
>
> On Fri, Apr 23, 2010 at 10:47 AM, Jonathan Ellis  wrote:
>>
>> You could look into it, but it's not going to be an easy backport
>> since SSTableReader and SSTableScanner got split into two classes in
>> trunk.
>>
>> On Fri, Apr 23, 2010 at 9:39 AM, Joost Ouwerkerk 
>> wrote:
>> > Awesome.  In the meantime, I hacked something similar myself.  The
>> > performance difference does not appear to be material.  I think the real
>> > killer is the get_range_slices call.  Relative to that, the cost of
>> > getting
>> > the connection appears to be more or less trivial.  What can I do to
>> > alleviate that cost?  CASSANDRA-821 looks interesting -- can I apply
>> > that to
>> > 0.6.1 ?
>> > joost.
>> > On Fri, Apr 23, 2010 at 9:39 AM, Jonathan Ellis 
>> > wrote:
>> >>
>> >> Great!  Created https://issues.apache.org/jira/browse/CASSANDRA-1017
>> >> to track this.
>> >>
>> >> On Fri, Apr 23, 2010 at 4:12 AM, Johan Oskarsson 
>> >> wrote:
>> >> > I have written some code to avoid thrift reconnection, it just keeps
>> >> > the
>> >> > connection open between get_range_slices calls.
>> >> > I can extract that and put it up but not until early next week.
>> >> >
>> >> > /Johan
>> >> >
>> >> > On 23 apr 2010, at 05.09, Jonathan Ellis wrote:
>> >> >
>> >> >> That would be an easy win, sure.
>> >> >>
>> >> >> On Thu, Apr 22, 2010 at 9:27 PM, Joost Ouwerkerk
>> >> >> 
>> >> >> wrote:
>> >> >>> I was getting client timeouts in
>> >> >>> ColumnFamilyRecordReader.maybeInit()
>> >> >>> when
>> >> >>> MapReducing.  So I've reduced the Range Batch Size to 256 (from
>> >> >>> 4096)
>> >> >>> and
>> >> >>> this seems to have fixed my problem, although it has slowed things
>> >> >>> down a
>> >> >>> bit -- presumably because there are 16x more calls to
>> >> >>> get_range_slices.
>> >> >>> While I was in that code I noticed that a new client was being
>> >> >>> created
>> >> >>> for
>> >> >>> each batch get.  By decreasing the batch size, I've increased this
>> >> >>> overhead.  I'm thinking of re-writing ColumnFamilyRecordReader to
>> >> >>> do
>> >> >>> some
>> >> >>> connection pooling.  Anyone have any thoughts on that?
>> >> >>> joost.
>> >> >>>
>> >> >
>> >> >
>> >
>> >
>
>

Re: newbie question on how columns names are indexed/lucene limitations?

2010-04-26 Thread Schubert Zhang

The column index in a row is a sorted-blocked index (like b-tree), just like
bigtable.

On Mon, Apr 26, 2010 at 2:43 AM, Stu Hood  wrote:

> The indexes within rows are _not_ implemented with Lucene: there is a
> custom index structure that allows for random access within a row. But, you
> should probably read http://wiki.apache.org/cassandra/CassandraLimitationsto 
> understand the current limitations of the file format, some of which are
> scheduled to be fixed soon.
>
> -Original Message-
> From: "TuX RaceR" 
> Sent: Sunday, April 25, 2010 11:54am
> To: user@cassandra.apache.org
> Subject: newbie question on how columns names are indexed/lucene
> limitations?
>
> Hello Cassandra Users,
>
> When use the RandomPartinionner and a simple ColumnFamily/Columns (i.e.
> no SuperColumns) my understanding is that one signle Row can store
> millions of columns.
>
> If I look at the http://wiki.apache.org/cassandra/API, I understand that
> I can get a subset of the millions of columns defined above using:
> SlicePredicate->ColumnNames or SlicePredicate->SliceRange
>
> My question is about the implementation of this columns 'selection'.
> I vaguely remember reading somewhere (but I cannot find the link again)
> that this was implemented using a Lucene index over the column names for
> each row.
> Is that true? Is there a small lucene index per row?
>
> Also we know from that lucene have some limitations
> http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations : you
> cannot index more than 2.1 billions documents as a document ID is mapped
> to a 32 bits int.
>
> As I plan to store in column names the ID of my cassandra documents (the
> global number of documents can go well beyond 2.1 billions), will I be
> hit by the lucene limitations? I.e can I store cassandra documents ID
> (i.e keys) in column names, if in each individual row there are no more
> than few millions of those IDs? I guess the answer is "yes I can",
> because lucandra uses a similar schema but it is not clear for me why.
> Is that because the lucene index is made on each row and what really
> matters in the number of columns in one single row and not the number of
> distinct column names (globally over all the rows)?
>
>
> Thanks in advance
> TuX
>
>
>

RE: Does anybody work about transaction on cassandra ?

2010-04-26 Thread Mark Jones

Orthogonal in this case means "at cross purposes"  Transactions can't really be 
done with eventual consistency because all nodes don't have all the info at the 
time the transaction is done.  I think they recommend zookeeper for this kind 
of stuff, but I don't know why you want to use Cassandra vs a RDBMS if you 
really want transactions.

From: dir dir [mailto:sikerasa...@gmail.com]
Sent: Saturday, April 24, 2010 12:08 PM
To: user@cassandra.apache.org
Subject: Re: Does anybody work about transaction on cassandra ?

>Transactions are orthogonal to the design of Cassandra

Sorry, Would you want to tell me what is an orthogonal mean in this context??
honestly I do not understand what is it.

Thank you.

On Thu, Apr 22, 2010 at 9:14 PM, Miguel Verde 
mailto:miguelitov...@gmail.com>> wrote:
No, as far as I know no one is working on transaction support in Cassandra.  
Transactions are orthogonal to the design of Cassandra[1][2], although a system 
could be designed incorporating Cassandra and other elements a la Google's 
MegaStore[3] to support transactions.  Google uses Paxos, one might be able to 
use Zookeeper[4] to design such a system, but it would be a daunting task.

[1] http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
[2] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
[3] http://perspectives.mvdirona.com/2008/07/10/GoogleMegastore.aspx
[4] http://hadoop.apache.org/zookeeper/
On Thu, Apr 22, 2010 at 2:56 AM, Jeff Zhang 
mailto:zjf...@gmail.com>> wrote:
Hi all,

I need transaction support on cassandra, so wondering is anybody work on it ?

--
Best Regards

Jeff Zhang

Re: org.apache.cassandra.dht.OrderPreservingPartitioner Initial Token

2010-04-26 Thread Schubert Zhang

Hi Jonathan Ellis and Stu Hood,

I think, finally, we should provide a user customizable key abstract class.
User can define what types of key and its class, which define how to compare
keys.

Schubert

On Sat, Apr 24, 2010 at 1:16 PM, Stu Hood  wrote:

> Your keys cannot be an encoded as binary for OPP, since Cassandra will
> attempt to decode them as UTF-8, meaning that they may not come back in the
> same format.
>
> 0.7 supports byte keys using the ByteOrderedPartitioner, and tokens are
> specified using hex.
>
> -Original Message-
> From: "Mark Jones" 
> Sent: Friday, April 23, 2010 10:55am
> To: "user@cassandra.apache.org" 
> Subject: RE: org.apache.cassandra.dht.OrderPreservingPartitioner Initial
> Token
>
> So if my keys are binary, is there any way to escape the keysequence in?
>
> I have 20 bytes (any value 0x0-0xff is possible) as the key.
>
> Are they compared as an array of bytes?  So that I can use truncation?
>
> 4 nodes, broken up by 0x00, 0x40, 0x80, 0xC0?
>
>
> -Original Message-
> From: Jonathan Ellis [mailto:jbel...@gmail.com]
> Sent: Friday, April 23, 2010 10:22 AM
> To: user@cassandra.apache.org
> Subject: Re: org.apache.cassandra.dht.OrderPreservingPartitioner Initial
> Token
>
> a normal String from the same universe as your keys.
>
> On Fri, Apr 23, 2010 at 7:23 AM, Mark Jones  wrote:
> > How is this specified?
> >
> > Is it a large hex #?
> >
> > A string of bytes in hex?
> >
> >
> >
> > http://wiki.apache.org/cassandra/StorageConfiguration doesn't say.
>
>
>

Re: ORM in Cassandra?

2010-04-26 Thread Schubert Zhang

I think you should forget these RDBMS tech.


On Sat, Apr 24, 2010 at 11:00 AM, aXqd  wrote:

> On Sat, Apr 24, 2010 at 1:36 AM, Ned Wolpert 
> wrote:
> > There is nothing wrong with what you are asking. Some work has been done
> to
> > get an ORM layer ontop of cassandra, for example, with a RubyOnRails
> > project. I'm trying to simplify cassandra integration with grails with
> the
> > plugin I'm writing.
> > The problem is ORM solutions to date are wrapping a relational database.
> > (The 'R' in ORM) Cassandra isn't a relational database so it does not map
> > cleanly.
>
> Thanks. I noticed this problem before. I just want to know, in the
> first place, what exactly is the right way to model relations in
> Cassandra(a no-relational database).
> So far, I still have those entities, and, without foreign keys, I use
> relational entities, which contains the IDs of both sides of
> relations.
> In some other cases, I just duplicate data, and maintain the relations
> manually by updating all the data in the same time.
>
> Is this the right way to go? Or what I am doing is still trying to
> convert Cassandra to a RDBMS?
>
> >
> > On Fri, Apr 23, 2010 at 1:29 AM, aXqd  wrote:
> >>
> >> On Fri, Apr 23, 2010 at 3:03 PM, Benoit Perroud 
> >> wrote:
> >> > I understand the question more like : Is there already a lib which
> >> > help to get rid of writing hardcoded and hard to maintain lines like :
> >> >
> >> > MyClass data;
> >> > String[] myFields = {"name", "label", ...}
> >> > List columns;
> >> > for (String field : myFields) {
> >> >if (field == "name") {
> >> >   columns.add(new Column(field, data.getName()))
> >> >} else if (field == "label") {
> >> >  columns.add(new Column(field, data.getLabel()))
> >> >} else ...
> >> > }
> >> > (same for loading (instanciating) automagically the object).
> >>
> >> Yes, I am talking about this question.
> >>
> >> >
> >> > Kind regards,
> >> >
> >> > Benoit.
> >> >
> >> > 2010/4/23 dir dir :
> >> >>>So maybe it's weird to combine ORM and Cassandra, right? Is there
> >> >>>anything we can take from ORM?
> >> >>
> >> >> Honestly I do not understand what is your question. It is clear that
> >> >> you can not combine ORM such as Hibernate or iBATIS with Cassandra.
> >> >> Cassandra it self is not a RDBMS, so you will not map the table into
> >> >> the object.
> >> >>
> >> >> Dir.
> >>
> >> Sorry, English is not my mother tongue.
> >>
> >> I do understand I cannot combine ORM with Cassandra, because they are
> >> totally different ways for building our data model. But I think there
> >> are still something can be learnt from ORM to make Cassandra easier to
> >> use, just as what ORM did to RDBMS before.
> >>
> >> IMHO, domain model is still intact when we design our software, hence
> >> we need another way to map them to Cassandra's entity model. Relation
> >> does not just go away in this case, hence we need another way to
> >> express those relations and have a tool to set up Keyspace /
> >> ColumnFamily automatically as what django's SYNCDB does.
> >>
> >> According to my limited experience with Cassandra, now, we do more
> >> when we write, and less when we read/query. Hence I think the problem
> >> lies exactly in how we duplicate our data to do queries.
> >>
> >> Please correct me if I got these all wrong.
> >>
> >> >>
> >> >> On Fri, Apr 23, 2010 at 12:12 PM, aXqd  wrote:
> >> >>>
> >> >>> Hi, all:
> >> >>>
> >> >>> I know many people regard O/R Mapping as rubbish. However it is
> >> >>> undeniable that ORM is quite easy to use in most simple cases,
> >> >>> Meanwhile Cassandra is well known as No-SQL solution, a.k.a.
> >> >>> No-Relational solution.
> >> >>> So maybe it's weird to combine ORM and Cassandra, right? Is there
> >> >>> anything we can take from ORM?
> >> >>> I just hate to write CRUD functions/Data layer for each object in
> even
> >> >>> a disposable prototype program.
> >> >>>
> >> >>> Regards.
> >> >>> -Tian
> >> >>
> >> >>
> >> >
> >
> >
> >
> > --
> > Virtually, Ned Wolpert
> >
> > "Settle thy studies, Faustus, and begin..."   --Marlowe
> >
>

Re: org.apache.cassandra.dht.OrderPreservingPartitioner Initial Token

2010-04-26 Thread Jonathan Ellis

this is what IPartitioner does

On Mon, Apr 26, 2010 at 10:16 AM, Schubert Zhang  wrote:
> Hi Jonathan Ellis and Stu Hood,
>
> I think, finally, we should provide a user customizable key abstract class.
> User can define what types of key and its class, which define how to compare
> keys.
>
> Schubert
>
> On Sat, Apr 24, 2010 at 1:16 PM, Stu Hood  wrote:
>>
>> Your keys cannot be an encoded as binary for OPP, since Cassandra will
>> attempt to decode them as UTF-8, meaning that they may not come back in the
>> same format.
>>
>> 0.7 supports byte keys using the ByteOrderedPartitioner, and tokens are
>> specified using hex.
>>
>> -Original Message-
>> From: "Mark Jones" 
>> Sent: Friday, April 23, 2010 10:55am
>> To: "user@cassandra.apache.org" 
>> Subject: RE: org.apache.cassandra.dht.OrderPreservingPartitioner Initial
>> Token
>>
>> So if my keys are binary, is there any way to escape the keysequence in?
>>
>> I have 20 bytes (any value 0x0-0xff is possible) as the key.
>>
>> Are they compared as an array of bytes?  So that I can use truncation?
>>
>> 4 nodes, broken up by 0x00, 0x40, 0x80, 0xC0?
>>
>>
>> -Original Message-
>> From: Jonathan Ellis [mailto:jbel...@gmail.com]
>> Sent: Friday, April 23, 2010 10:22 AM
>> To: user@cassandra.apache.org
>> Subject: Re: org.apache.cassandra.dht.OrderPreservingPartitioner Initial
>> Token
>>
>> a normal String from the same universe as your keys.
>>
>> On Fri, Apr 23, 2010 at 7:23 AM, Mark Jones  wrote:
>> > How is this specified?
>> >
>> > Is it a large hex #?
>> >
>> > A string of bytes in hex?
>> >
>> >
>> >
>> > http://wiki.apache.org/cassandra/StorageConfiguration doesn't say.
>>
>>
>
>

Re: running cassandra as a service on windows

2010-04-26 Thread Antonio Alvarado Hernández

Hi all,
Had you tried with Tanuki's Java Wrapper? It's so easy to deploy in Windows...
-aah

2010/4/23, Miguel Verde :
> https://issues.apache.org/jira/browse/CASSANDRA-292 points to
> http://commons.apache.org/daemon/procrun.html which is used by other Apache
> software to implement Windows services in Java.  CassandraDaemon conforms to
> the Commons Daemon spec.
> On Fri, Apr 23, 2010 at 2:20 PM, Jonathan Ellis  wrote:
>
>> you could do it with standard techniques to run java apps as windows
>> services.  i understand it's a bit painful.
>>
>> On Fri, Apr 23, 2010 at 2:05 PM, S Ahmed  wrote:
>> > Is it possible to have Cassandra run in the background on a windows
>> server?
>> > i.e. as a service so if the server reboots, cassandra will automatically
>> > run?
>> > I really hate how windows handles services
>>
>

-- 
Enviado desde mi dispositivo móvil

Re: ORM in Cassandra?

2010-04-26 Thread Geoffry Roberts

I am going to agree with axQd. Having something that does for Cassandra what
say, Hibernate does for RDBMS seems an effort well worth pursuing.  I have
some complex object graphs written in Java.  If I could annotate them and
get persistence with a well laid out schema. It would be good.

On Mon, Apr 26, 2010 at 8:21 AM, Schubert Zhang  wrote:

> I think you should forget these RDBMS tech.
>
>
>
> On Sat, Apr 24, 2010 at 11:00 AM, aXqd  wrote:
>
>> On Sat, Apr 24, 2010 at 1:36 AM, Ned Wolpert 
>> wrote:
>> > There is nothing wrong with what you are asking. Some work has been done
>> to
>> > get an ORM layer ontop of cassandra, for example, with a RubyOnRails
>> > project. I'm trying to simplify cassandra integration with grails with
>> the
>> > plugin I'm writing.
>> > The problem is ORM solutions to date are wrapping a relational database.
>> > (The 'R' in ORM) Cassandra isn't a relational database so it does not
>> map
>> > cleanly.
>>
>> Thanks. I noticed this problem before. I just want to know, in the
>> first place, what exactly is the right way to model relations in
>> Cassandra(a no-relational database).
>> So far, I still have those entities, and, without foreign keys, I use
>> relational entities, which contains the IDs of both sides of
>> relations.
>> In some other cases, I just duplicate data, and maintain the relations
>> manually by updating all the data in the same time.
>>
>> Is this the right way to go? Or what I am doing is still trying to
>> convert Cassandra to a RDBMS?
>>
>> >
>> > On Fri, Apr 23, 2010 at 1:29 AM, aXqd  wrote:
>> >>
>> >> On Fri, Apr 23, 2010 at 3:03 PM, Benoit Perroud 
>> >> wrote:
>> >> > I understand the question more like : Is there already a lib which
>> >> > help to get rid of writing hardcoded and hard to maintain lines like
>> :
>> >> >
>> >> > MyClass data;
>> >> > String[] myFields = {"name", "label", ...}
>> >> > List columns;
>> >> > for (String field : myFields) {
>> >> >if (field == "name") {
>> >> >   columns.add(new Column(field, data.getName()))
>> >> >} else if (field == "label") {
>> >> >  columns.add(new Column(field, data.getLabel()))
>> >> >} else ...
>> >> > }
>> >> > (same for loading (instanciating) automagically the object).
>> >>
>> >> Yes, I am talking about this question.
>> >>
>> >> >
>> >> > Kind regards,
>> >> >
>> >> > Benoit.
>> >> >
>> >> > 2010/4/23 dir dir :
>> >> >>>So maybe it's weird to combine ORM and Cassandra, right? Is there
>> >> >>>anything we can take from ORM?
>> >> >>
>> >> >> Honestly I do not understand what is your question. It is clear that
>> >> >> you can not combine ORM such as Hibernate or iBATIS with Cassandra.
>> >> >> Cassandra it self is not a RDBMS, so you will not map the table into
>> >> >> the object.
>> >> >>
>> >> >> Dir.
>> >>
>> >> Sorry, English is not my mother tongue.
>> >>
>> >> I do understand I cannot combine ORM with Cassandra, because they are
>> >> totally different ways for building our data model. But I think there
>> >> are still something can be learnt from ORM to make Cassandra easier to
>> >> use, just as what ORM did to RDBMS before.
>> >>
>> >> IMHO, domain model is still intact when we design our software, hence
>> >> we need another way to map them to Cassandra's entity model. Relation
>> >> does not just go away in this case, hence we need another way to
>> >> express those relations and have a tool to set up Keyspace /
>> >> ColumnFamily automatically as what django's SYNCDB does.
>> >>
>> >> According to my limited experience with Cassandra, now, we do more
>> >> when we write, and less when we read/query. Hence I think the problem
>> >> lies exactly in how we duplicate our data to do queries.
>> >>
>> >> Please correct me if I got these all wrong.
>> >>
>> >> >>
>> >> >> On Fri, Apr 23, 2010 at 12:12 PM, aXqd  wrote:
>> >> >>>
>> >> >>> Hi, all:
>> >> >>>
>> >> >>> I know many people regard O/R Mapping as rubbish. However it is
>> >> >>> undeniable that ORM is quite easy to use in most simple cases,
>> >> >>> Meanwhile Cassandra is well known as No-SQL solution, a.k.a.
>> >> >>> No-Relational solution.
>> >> >>> So maybe it's weird to combine ORM and Cassandra, right? Is there
>> >> >>> anything we can take from ORM?
>> >> >>> I just hate to write CRUD functions/Data layer for each object in
>> even
>> >> >>> a disposable prototype program.
>> >> >>>
>> >> >>> Regards.
>> >> >>> -Tian
>> >> >>
>> >> >>
>> >> >
>> >
>> >
>> >
>> > --
>> > Virtually, Ned Wolpert
>> >
>> > "Settle thy studies, Faustus, and begin..."   --Marlowe
>> >
>>
>
>

Re: Trying To Understand get_range_slices Results When Using RandomPartitioner

2010-04-26 Thread Schubert Zhang

RandomPartioner  is for row-keys.

#1  no
#2 yes
#3 yes

On Sat, Apr 24, 2010 at 4:33 AM, Larry Root  wrote:

> I trying to better understand how using the RandomPartitioner will affect
> my ability to select ranges of keys. Consider my simple example where we
> have many online games across different game genres (GameType). These games
> need to store data for each one of their users. With that in mind consider
> the following data model:
>
> enum GameType {'RPG', 'FPS', 'ARCADE'}
>
> {
> "GameData": { // Super Column Family
>
> *GameType+"1234"*: {// Row (concat gametype with a
> game id for example)
> *"user-data:5678"*:{// Super column (user data)
> *"user_prop_name"*: "value",// Subcolumn (arbitrary user
> properties and values)
> *"another_prop_name"*: "value",
>  ...
> },
> *"user-data:9012"*:{
> *"**user_prop_name**"*: "value",
>  ...
> }
> },
>
> * GameType+"3456"*: {...},
> *GameType+"7890"*: {...},
> ...
> }
> }
>
> Assume we have a multi node cluster running Cassandra 0.6.1. In that
> scenario could some one help me understand what the result would be in the
> following cases:
>
>1. We use a range slice to grab keys for all 'RPG' games (range slice
>at the ROW level). Would we be able to get all games back in a single query
>or would that not be guaranteed?
>
>2. For a given game we use a range slice to grab all user-data keys in
>which the ID starts with '5' (range slice at the COLUMN level). Again, 
> would
>we be able to get all keys in one call (assuming number of keys in the
>result was not an issue)?
>
>3. Finally for a given game and a given user we do a range slice to
>grab all user properties that start with 'a' (range slice at the SUBCOLUMN
>level of a SUPERCOLUMN). Is that possible in one call?
>
> I'm trying to understand at what level the RandomPartioner affects my
> example data model. Is it at a fixed level like just ROWS (the sub data is
> fixed to the same node) or is all data at every level *randomized* across
> all nodes.
>
> Are there any tricks to doing these sort of range slices using RP? For
> example if I set my consistency level to 'ALL' when doing a range slice
> would that effectively compile a complete result set for me?
>
> Thanks for the help!
>
> larry

Re: ORM in Cassandra?

2010-04-26 Thread Ned Wolpert

I don't think you are trying to convert Cassandra to a RDBMS with what you
want. The issue is that finding a way to map these objects to Cassandra in a
meaningful way is hard. Its not as easy as saying 'do what hibernate does'
simply because its not an RDBMS... but it is a reasonable and useful goal.
I'm trying to accomplish this myself with the grails Cassandra plugin.

On Fri, Apr 23, 2010 at 8:00 PM, aXqd  wrote:

> On Sat, Apr 24, 2010 at 1:36 AM, Ned Wolpert 
> wrote:
> > There is nothing wrong with what you are asking. Some work has been done
> to
> > get an ORM layer ontop of cassandra, for example, with a RubyOnRails
> > project. I'm trying to simplify cassandra integration with grails with
> the
> > plugin I'm writing.
> > The problem is ORM solutions to date are wrapping a relational database.
> > (The 'R' in ORM) Cassandra isn't a relational database so it does not map
> > cleanly.
>
> Thanks. I noticed this problem before. I just want to know, in the
> first place, what exactly is the right way to model relations in
> Cassandra(a no-relational database).
> So far, I still have those entities, and, without foreign keys, I use
> relational entities, which contains the IDs of both sides of
> relations.
> In some other cases, I just duplicate data, and maintain the relations
> manually by updating all the data in the same time.
>
> Is this the right way to go? Or what I am doing is still trying to
> convert Cassandra to a RDBMS?
>
> >
> > On Fri, Apr 23, 2010 at 1:29 AM, aXqd  wrote:
> >>
> >> On Fri, Apr 23, 2010 at 3:03 PM, Benoit Perroud 
> >> wrote:
> >> > I understand the question more like : Is there already a lib which
> >> > help to get rid of writing hardcoded and hard to maintain lines like :
> >> >
> >> > MyClass data;
> >> > String[] myFields = {"name", "label", ...}
> >> > List columns;
> >> > for (String field : myFields) {
> >> >if (field == "name") {
> >> >   columns.add(new Column(field, data.getName()))
> >> >} else if (field == "label") {
> >> >  columns.add(new Column(field, data.getLabel()))
> >> >} else ...
> >> > }
> >> > (same for loading (instanciating) automagically the object).
> >>
> >> Yes, I am talking about this question.
> >>
> >> >
> >> > Kind regards,
> >> >
> >> > Benoit.
> >> >
> >> > 2010/4/23 dir dir :
> >> >>>So maybe it's weird to combine ORM and Cassandra, right? Is there
> >> >>>anything we can take from ORM?
> >> >>
> >> >> Honestly I do not understand what is your question. It is clear that
> >> >> you can not combine ORM such as Hibernate or iBATIS with Cassandra.
> >> >> Cassandra it self is not a RDBMS, so you will not map the table into
> >> >> the object.
> >> >>
> >> >> Dir.
> >>
> >> Sorry, English is not my mother tongue.
> >>
> >> I do understand I cannot combine ORM with Cassandra, because they are
> >> totally different ways for building our data model. But I think there
> >> are still something can be learnt from ORM to make Cassandra easier to
> >> use, just as what ORM did to RDBMS before.
> >>
> >> IMHO, domain model is still intact when we design our software, hence
> >> we need another way to map them to Cassandra's entity model. Relation
> >> does not just go away in this case, hence we need another way to
> >> express those relations and have a tool to set up Keyspace /
> >> ColumnFamily automatically as what django's SYNCDB does.
> >>
> >> According to my limited experience with Cassandra, now, we do more
> >> when we write, and less when we read/query. Hence I think the problem
> >> lies exactly in how we duplicate our data to do queries.
> >>
> >> Please correct me if I got these all wrong.
> >>
> >> >>
> >> >> On Fri, Apr 23, 2010 at 12:12 PM, aXqd  wrote:
> >> >>>
> >> >>> Hi, all:
> >> >>>
> >> >>> I know many people regard O/R Mapping as rubbish. However it is
> >> >>> undeniable that ORM is quite easy to use in most simple cases,
> >> >>> Meanwhile Cassandra is well known as No-SQL solution, a.k.a.
> >> >>> No-Relational solution.
> >> >>> So maybe it's weird to combine ORM and Cassandra, right? Is there
> >> >>> anything we can take from ORM?
> >> >>> I just hate to write CRUD functions/Data layer for each object in
> even
> >> >>> a disposable prototype program.
> >> >>>
> >> >>> Regards.
> >> >>> -Tian
> >> >>
> >> >>
> >> >
> >
> >
> >
> > --
> > Virtually, Ned Wolpert
> >
> > "Settle thy studies, Faustus, and begin..."   --Marlowe
> >
>



-- 
Virtually, Ned Wolpert

"Settle thy studies, Faustus, and begin..."   --Marlowe

Is SuperColumn necessary?

2010-04-26 Thread Schubert Zhang

I don't think the SuperColumn is so necessary.
I think this level of logic can be leaved to application.

Do you think so?

If SuperColumn is needed,  as
https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index
in SuperColumns level and SubColumns level.
Thus, the levels of index is too many.

Re: Does anybody work about transaction on cassandra ?

2010-04-26 Thread Cagatay Kavukcuoglu

Better fault tolerance? Scalability to large data volumes? A combination of 
ZooKeeper based transactions and Cassandra may have better characteristics than 
RDBMS on these criteria. There's no question that trade-offs are involved, but 
as far as these issues are concerned, you'd be starting from a better vantage 
point than a SPOF relational database. 

On Apr 26, 2010, at 10:24 AM, Mark Jones wrote:

> Orthogonal in this case means “at cross purposes”  Transactions can’t really 
> be done with eventual consistency because all nodes don’t have all the info 
> at the time the transaction is done.  I think they recommend zookeeper for 
> this kind of stuff, but I don’t know why you want to use Cassandra vs a RDBMS 
> if you really want transactions.
>  
> From: dir dir [mailto:sikerasa...@gmail.com] 
> Sent: Saturday, April 24, 2010 12:08 PM
> To: user@cassandra.apache.org
> Subject: Re: Does anybody work about transaction on cassandra ?
>  
> >Transactions are orthogonal to the design of Cassandra
> 
> Sorry, Would you want to tell me what is an orthogonal mean in this context??
> honestly I do not understand what is it.
> 
> Thank you.
> 
> 
> On Thu, Apr 22, 2010 at 9:14 PM, Miguel Verde  wrote:
> No, as far as I know no one is working on transaction support in Cassandra.  
> Transactions are orthogonal to the design of Cassandra[1][2], although a 
> system could be designed incorporating Cassandra and other elements a la 
> Google's MegaStore[3] to support transactions.  Google uses Paxos, one might 
> be able to use Zookeeper[4] to design such a system, but it would be a 
> daunting task.
>  
> [1] http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
> [2] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
> [3] http://perspectives.mvdirona.com/2008/07/10/GoogleMegastore.aspx
> [4] http://hadoop.apache.org/zookeeper/
> 
> On Thu, Apr 22, 2010 at 2:56 AM, Jeff Zhang  wrote:
> Hi all,
> 
> I need transaction support on cassandra, so wondering is anybody work on it ?
> 
> 
> --
> Best Regards
> 
> Jeff Zhang
>  
>

Re: Is SuperColumn necessary?

2010-04-26 Thread Jonathan Ellis

I think that once we have built-in indexing (CASSANDRA-749) you can
make a good case for dropping supercolumns (at least, dropping them
from the public API and reserving them for internal use).

On Mon, Apr 26, 2010 at 11:05 AM, Schubert Zhang  wrote:
> I don't think the SuperColumn is so necessary.
> I think this level of logic can be leaved to application.
>
> Do you think so?
>
> If SuperColumn is needed,  as
> https://issues.apache.org/jira/browse/CASSANDRA-598, we should build index
> in SuperColumns level and SubColumns level.
> Thus, the levels of index is too many.
>
>

Re: ORM in Cassandra?

2010-04-26 Thread Geoffry Roberts

Clearly Cassandra is not an RDBMS.  The intent of my Hibernate reference was
to be more lyrical.  Sorry if that didn't come through.

Nonetheless, the need remains to relieve ourselves from excessive
boilerplate coding.

On Mon, Apr 26, 2010 at 9:00 AM, Ned Wolpert wrote:

> I don't think you are trying to convert Cassandra to a RDBMS with what you
> want. The issue is that finding a way to map these objects to Cassandra in a
> meaningful way is hard. Its not as easy as saying 'do what hibernate does'
> simply because its not an RDBMS...but it is a reasonable and useful goal.
> I'm trying to accomplish this myself with the grails Cassandra plugin.
>
>
> On Fri, Apr 23, 2010 at 8:00 PM, aXqd  wrote:
>
>> On Sat, Apr 24, 2010 at 1:36 AM, Ned Wolpert 
>> wrote:
>> > There is nothing wrong with what you are asking. Some work has been done
>> to
>> > get an ORM layer ontop of cassandra, for example, with a RubyOnRails
>> > project. I'm trying to simplify cassandra integration with grails with
>> the
>> > plugin I'm writing.
>> > The problem is ORM solutions to date are wrapping a relational database.
>> > (The 'R' in ORM) Cassandra isn't a relational database so it does not
>> map
>> > cleanly.
>>
>> Thanks. I noticed this problem before. I just want to know, in the
>> first place, what exactly is the right way to model relations in
>> Cassandra(a no-relational database).
>> So far, I still have those entities, and, without foreign keys, I use
>> relational entities, which contains the IDs of both sides of
>> relations.
>> In some other cases, I just duplicate data, and maintain the relations
>> manually by updating all the data in the same time.
>>
>> Is this the right way to go? Or what I am doing is still trying to
>> convert Cassandra to a RDBMS?
>>
>> >
>> > On Fri, Apr 23, 2010 at 1:29 AM, aXqd  wrote:
>> >>
>> >> On Fri, Apr 23, 2010 at 3:03 PM, Benoit Perroud 
>> >> wrote:
>> >> > I understand the question more like : Is there already a lib which
>> >> > help to get rid of writing hardcoded and hard to maintain lines like
>> :
>> >> >
>> >> > MyClass data;
>> >> > String[] myFields = {"name", "label", ...}
>> >> > List columns;
>> >> > for (String field : myFields) {
>> >> >if (field == "name") {
>> >> >   columns.add(new Column(field, data.getName()))
>> >> >} else if (field == "label") {
>> >> >  columns.add(new Column(field, data.getLabel()))
>> >> >} else ...
>> >> > }
>> >> > (same for loading (instanciating) automagically the object).
>> >>
>> >> Yes, I am talking about this question.
>> >>
>> >> >
>> >> > Kind regards,
>> >> >
>> >> > Benoit.
>> >> >
>> >> > 2010/4/23 dir dir :
>> >> >>>So maybe it's weird to combine ORM and Cassandra, right? Is there
>> >> >>>anything we can take from ORM?
>> >> >>
>> >> >> Honestly I do not understand what is your question. It is clear that
>> >> >> you can not combine ORM such as Hibernate or iBATIS with Cassandra.
>> >> >> Cassandra it self is not a RDBMS, so you will not map the table into
>> >> >> the object.
>> >> >>
>> >> >> Dir.
>> >>
>> >> Sorry, English is not my mother tongue.
>> >>
>> >> I do understand I cannot combine ORM with Cassandra, because they are
>> >> totally different ways for building our data model. But I think there
>> >> are still something can be learnt from ORM to make Cassandra easier to
>> >> use, just as what ORM did to RDBMS before.
>> >>
>> >> IMHO, domain model is still intact when we design our software, hence
>> >> we need another way to map them to Cassandra's entity model. Relation
>> >> does not just go away in this case, hence we need another way to
>> >> express those relations and have a tool to set up Keyspace /
>> >> ColumnFamily automatically as what django's SYNCDB does.
>> >>
>> >> According to my limited experience with Cassandra, now, we do more
>> >> when we write, and less when we read/query. Hence I think the problem
>> >> lies exactly in how we duplicate our data to do queries.
>> >>
>> >> Please correct me if I got these all wrong.
>> >>
>> >> >>
>> >> >> On Fri, Apr 23, 2010 at 12:12 PM, aXqd  wrote:
>> >> >>>
>> >> >>> Hi, all:
>> >> >>>
>> >> >>> I know many people regard O/R Mapping as rubbish. However it is
>> >> >>> undeniable that ORM is quite easy to use in most simple cases,
>> >> >>> Meanwhile Cassandra is well known as No-SQL solution, a.k.a.
>> >> >>> No-Relational solution.
>> >> >>> So maybe it's weird to combine ORM and Cassandra, right? Is there
>> >> >>> anything we can take from ORM?
>> >> >>> I just hate to write CRUD functions/Data layer for each object in
>> even
>> >> >>> a disposable prototype program.
>> >> >>>
>> >> >>> Regards.
>> >> >>> -Tian
>> >> >>
>> >> >>
>> >> >
>> >
>> >
>> >
>> > --
>> > Virtually, Ned Wolpert
>> >
>> > "Settle thy studies, Faustus, and begin..."   --Marlowe
>> >
>>
>
>
>
> --
> Virtually, Ned Wolpert
>
> "Settle thy studies, Faustus, and begin..."   --Marlowe
>

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Ryan King

I would recommend using RAID-0 rather that multiple data directories.

-ryan

2010/4/26 Roland Hänel :
> I have a configuration like this:
>
>   
>   /storage01/cassandra/data
>   /storage02/cassandra/data
>   /storage03/cassandra/data
>   
>
> After loading a big chunk of data into cassandra, I end up wich some 70GB in
> the first directory, and only about 10GB in the second and third one. All
> rows are quite small, so it's not just some big rows that contain the
> majority of data.
>
> Does Cassandra have the ability to 'see' the maximum available space in
> these directory? I'm asking myself this question since my limit is 100GB,
> and the first directory is approaching this limit...
>
> And, wouldn't it be better if Cassandra tried to 'load-balance' the files
> inside the directories because this will result in better (read) performance
> if the directories are on different disks (which is the case for me)?
>
> Any help is appreciated.
>
> Roland
>
>

Re: How do you construct an index and use it, especially in Ruby

2010-04-26 Thread Ryan King

On Sun, Apr 25, 2010 at 11:14 AM, Bob Hutchison
 wrote:
>
> Hi,
>
> I'm new to Cassandra and trying to work out how to do something that I've 
> implemented any number of times (e.g. TokyoCabinet, Perst, even the 
> filesystem using grep :-) I've managed to get some of this working in 
> Cassandra but not all.
>
> So here's the core of the situation.
>
> I have this opaque chunk of data that I want to store in Cassandra and then 
> find it again.
>
> I can generate a key when the data is created very easily, and I've stored it 
> in a straight forward manner: in a column with a key whose value is the data. 
> And I can retrieve it when I know the key. No difficulties here at all, works 
> fine.
>
> Now I want to index this data taking what I imagine to be a pretty typical 
> approach.
>
> Lets say there's two many-to-one indexes: 'colour', and 'size'. Each colour 
> value will have more than one chunk of data, same for size.
>
> What I thought I'd do is make a super column and index the chunk of data kind 
> of like: { 'colour' => { 'blue' => 1 }, 'size' => { 'large' => 1}} with the 
> key equal to the key of the chunk of data. And Cassandra stores it without 
> error like that. So using the Ruby gem, it'd be something along the lines of:
>
>  cassandra.insert(:Indexes, key-of-the-chunk-of-data, { 'colour' => { 'blue' 
> => 1 }, 'size' => { 'large' => 1 } })
>
> Q1: is this a reasonable approach? It *seems* to be what I've read is 
> supposed to be done. The 1 is meaningless. Anyway, it executes without error 
> in Ruby.

No. In order to index your data, you need to invert it. Since you're
working in ruby I'd recommend CassandraObject:
http://github.com/nzKoz/cassandra_object. It has indexing built in.

-ryan

> Q2: what is the syntax of the (Ruby) query to find the keys of all 'blue' 
> chunks of data? I'm assuming get_range is the correct method, but what are 
> the parameters? The docs say: get_range(column_family, options={}) but that 
> seems to be missing a bit of detail, in particular the super column name.
>
> Q2a: So I know there's a :start and :finish key supported in the options 
> hash, inclusive, exclusive respectively. How do you define a range for equals 
> with a UTF8 key? Surely not 'blue'.succ?? or by some kind of suffix??
>
> Q2b: How do you specify the super column name 'colour'? Looking at the (Ruby) 
> source of the get_range method and I'm unconvinced that this is implemented 
> (seems to be a constant '' used where the super column name makes sense to 
> be.)
>
> Anyway I ended up hacking at the Ruby gem's source to use the column name 
> where the '' was in the original, and didn't really get anywhere useful (I 
> can find nothing, or everything, nothing in between).
>
> Q3: If I am correct about what is supposed to be done, does the Ruby gem 
> support it?
>
> Q4: Does anyone know of some Ruby code that does and indexed lookup that they 
> could point me at. (lots of code that indexes but nothing that searches by 
> the index)
>
> I'll try to take a look at some of the other Cassandra client implementations 
> and see if I can get this model to work. Maybe just a Ruby problem?? With any 
> luck, it'll be me messing up.
>
> If it'd help I can post the source of what I have, but it'll need some 
> cleanup. Let me know.
>
> Thanks for taking the time to read this far :-)
>
> Bob
>
> 
> Bob Hutchison
> Recursive Design Inc.
> http://www.recursive.ca/
> weblog: http://xampl.com/so
>
>
> 
> Bob Hutchison
> Recursive Design Inc.
> http://www.recursive.ca/
> weblog: http://xampl.com/so
>
>
>
>
>

cassandra 0.5.1 java.lang.OutOfMemoryError: Java heap space issue

2010-04-26 Thread elsif


Hello.  I have a six node cassandra cluster running on modest hardware
with 1G of heap assigned to cassandra.  After inserting about 245
million rows of data, cassandra failed with a
java.lang.OutOfMemoryError: Java heap space error.  I rasied the java
heap to 2G, but still get the same error when trying to restart cassandra.

I am using Cassandra 0.5.1 with Sun jre1.6.0_18.

Any thoughts on how to resolve this issue are greatly appreciated.

Here are log excerpts from two of the nodes:

DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,490
SliceQueryFilter.java (line 116) collecting SuperColumn(dcf9f19e
[0a011d0d,])
DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,490
SliceQueryFilter.java (line 116) collecting SuperColumn(dd04bf9c
[0a011d0c,0a011d0d,])
DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,490
SliceQueryFilter.java (line 116) collecting SuperColumn(dd08981a
[0a011d0c,0a011d0d,])
DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,490
SliceQueryFilter.java (line 116) collecting SuperColumn(dd7f7ac9
[0a011d0c,0a011d0d,])
DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,490
SliceQueryFilter.java (line 116) collecting SuperColumn(dde1d4cf
[0a011d0d,])
DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,491
SliceQueryFilter.java (line 116) collecting SuperColumn(de32aec3
[0a011d0d,])
DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,491
SliceQueryFilter.java (line 116) collecting SuperColumn(de378105
[0a011d0c,0a011d0d,])
DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,491
SliceQueryFilter.java (line 116) collecting SuperColumn(deb5d591
[0a011d0d,])
DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,491
SliceQueryFilter.java (line 116) collecting SuperColumn(ded75dee
[0a011d0c,0a011d0d,])
DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,491
SliceQueryFilter.java (line 116) collecting SuperColumn(defe3445
[0a011d0c,0a011d0d,])
 INFO [FLUSH-TIMER] 2010-04-23 16:20:00,071 ColumnFamilyStore.java (line
393) IpTag has reached its threshold; switching in a fresh Memtable
 INFO [FLUSH-TIMER] 2010-04-23 16:20:00,072 ColumnFamilyStore.java (line
1035) Enqueuing flush of Memtable(IpTag)@7816
 INFO [FLUSH-SORTER-POOL:1] 2010-04-23 16:20:00,072 Memtable.java (line
183) Sorting Memtable(IpTag)@7816
 INFO [FLUSH-WRITER-POOL:1] 2010-04-23 16:20:00,107 Memtable.java (line
192) Writing Memtable(IpTag)@7816
DEBUG [Timer-0] 2010-04-23 16:20:00,130 LoadDisseminator.java (line 39)
Disseminating load info ...
ERROR [ROW-MUTATION-STAGE:41] 2010-04-23 16:20:00,348
CassandraDaemon.java (line 71) Fatal exception in thread
Thread[ROW-MUTATION-STAGE:41,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.(Unknown Source)
at java.lang.StringBuilder.toString(Unknown Source)
at
org.apache.cassandra.db.marshal.AbstractType.getColumnsString(AbstractType.java:87)
at
org.apache.cassandra.db.ColumnFamily.toString(ColumnFamily.java:344)
at
org.apache.commons.lang.ObjectUtils.toString(ObjectUtils.java:241)
at org.apache.commons.lang.StringUtils.join(StringUtils.java:3073)
at org.apache.commons.lang.StringUtils.join(StringUtils.java:3133)
at
org.apache.cassandra.db.RowMutation.toString(RowMutation.java:263)
at java.lang.String.valueOf(Unknown Source)
at java.lang.StringBuilder.append(Unknown Source)
at
org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:46)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:38)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)

---

DEBUG [main] 2010-04-23 17:15:45,501 CommitLog.java (line 312) Reading
mutation at 57527476
DEBUG [main] 2010-04-23 17:16:11,375 CommitLog.java (line 340) replaying
mutation for system.Tracking: {ColumnFamily(HintsColumnFamily [7af4c5c0,])}
DEBUG [main] 2010-04-23 17:16:45,293 CommitLog.java (line 312) Reading
mutation at 57527686
DEBUG [main] 2010-04-23 17:16:45,294 CommitLog.java (line 340) replaying
mutation for system.Tracking: {ColumnFamily(HintsColumnFamily [7af4c5fb,])}
DEBUG [main] 2010-04-23 17:16:54,311 CommitLog.java (line 312) Reading
mutation at 57527919
DEBUG [main] 2010-04-23 17:17:46,344 CommitLog.java (line 340) replaying
mutation for system.Tracking: {ColumnFamily(HintsColumnFamily [7af4c5fb,])}
DEBUG [main] 2010-04-23 17:17:55,530 CommitLog.java (line 312) Reading
mutation at 57528129
DEBUG [main] 2010-04-23 17:18:20,266 CommitLog.java (line 340) replaying
mutation for system.Tracking: {ColumnFamily(HintsColumnFamily [7af4c607,])}
DEBUG [main] 2010-04-23 17:18:38,273 CommitLog.java (line 312) Reading
mutation at 57528362
DEBUG [main] 2010-04-23 17:21:53,966 CommitLog.java (line 340) replaying
mutation for system.Tracking: {ColumnFamily(HintsColumnFamily [7af4c607,])}
DEBUG [mai

Cassandra Job in Pasadena

2010-04-26 Thread Anthony Molinaro

Hi,

  OpenX is looking for someone to work fulltime on Cassandra, we're
located in Pasadena, CA.  Here's a link to the job description

http://www.openx.org/jobs/position/software-engineer-infrastructure

We've been running cassandra in production since 0.3.0, and currently
have 3 cassandra clusters.  Feel free to email me offlist any questions
you might have, and if you are interested please send your resume.

Thanks,

-Anthony

-- 

Anthony Molinaro

Re: ORM in Cassandra?

2010-04-26 Thread Isaac Arias

On Apr 26, 2010, at 12:13 PM, Geoffry Roberts wrote:

> Clearly Cassandra is not an RDBMS.  The intent of my Hibernate
> reference was to be more lyrical.  Sorry if that didn't come through.

> Nonetheless, the need remains to relieve ourselves from excessive
> boilerplate coding.

I agree with eliminating boilerplate code. Chris Shorrock wrote a
simple object mapper in Scala for his Cascal Cassandra client. You may
want to check out the wiki on GitHub
(http://wiki.github.com/shorrockin/cascal/).

In my opinion, a mapping solution for Cassandra should be more like a
Template. Something that helps map (back and forth) rows to objects,
columns to properties, etc. Since the data model can vary so much
depending on data access patters, any overly structured approach that
prescribes a particular schema will be of limited use.

If you're from the Java world, think of iBATIS vs. Hibernate.


>
> On Mon, Apr 26, 2010 at 9:00 AM, Ned Wolpert =
 wrote:
> I don't think you are trying to convert Cassandra to a RDBMS with what =
you want. The issue is that finding a way to map these objects to =
Cassandra in a meaningful way is hard. Its not as easy as saying 'do =
what hibernate does' simply because its not an RDBMS...but it is a =
reasonable and useful goal. I'm trying to accomplish this myself with =
the grails Cassandra plugin.
>=20
>=20
> On Fri, Apr 23, 2010 at 8:00 PM, aXqd  wrote:
> On Sat, Apr 24, 2010 at 1:36 AM, Ned Wolpert =
 wrote:
> > There is nothing wrong with what you are asking. Some work has been =
done to
> > get an ORM layer ontop of cassandra, for example, with a RubyOnRails
> > project. I'm trying to simplify cassandra integration with grails =
with the
> > plugin I'm writing.
> > The problem is ORM solutions to date are wrapping a relational =
database.
> > (The 'R' in ORM) Cassandra isn't a relational database so it does =
not map
> > cleanly.
>=20
> Thanks. I noticed this problem before. I just want to know, in the
> first place, what exactly is the right way to model relations in
> Cassandra(a no-relational database).
> So far, I still have those entities, and, without foreign keys, I use
> relational entities, which contains the IDs of both sides of
> relations.
> In some other cases, I just duplicate data, and maintain the relations
> manually by updating all the data in the same time.
>=20
> Is this the right way to go? Or what I am doing is still trying to
> convert Cassandra to a RDBMS?
>=20
> >
> > On Fri, Apr 23, 2010 at 1:29 AM, aXqd  wrote:
> >>
> >> On Fri, Apr 23, 2010 at 3:03 PM, Benoit Perroud =

> >> wrote:
> >> > I understand the question more like : Is there already a lib =
which
> >> > help to get rid of writing hardcoded and hard to maintain lines =
like :
> >> >
> >> > MyClass data;
> >> > String[] myFields =3D {"name", "label", ...}
> >> > List columns;
> >> > for (String field : myFields) {
> >> >if (field =3D=3D "name") {
> >> >   columns.add(new Column(field, data.getName()))
> >> >} else if (field =3D=3D "label") {
> >> >  columns.add(new Column(field, data.getLabel()))
> >> >} else ...
> >> > }
> >> > (same for loading (instanciating) automagically the object).
> >>
> >> Yes, I am talking about this question.
> >>
> >> >
> >> > Kind regards,
> >> >
> >> > Benoit.
> >> >
> >> > 2010/4/23 dir dir :
> >> >>>So maybe it's weird to combine ORM and Cassandra, right? Is =
there
> >> >>>anything we can take from ORM?
> >> >>
> >> >> Honestly I do not understand what is your question. It is clear =
that
> >> >> you can not combine ORM such as Hibernate or iBATIS with =
Cassandra.
> >> >> Cassandra it self is not a RDBMS, so you will not map the table =
into
> >> >> the object.
> >> >>
> >> >> Dir.
> >>
> >> Sorry, English is not my mother tongue.
> >>
> >> I do understand I cannot combine ORM with Cassandra, because they =
are
> >> totally different ways for building our data model. But I think =
there
> >> are still something can be learnt from ORM to make Cassandra easier =
to
> >> use, just as what ORM did to RDBMS before.
> >>
> >> IMHO, domain model is still intact when we design our software, =
hence
> >> we need another way to map them to Cassandra's entity model. =
Relation
> >> does not just go away in this case, hence we need another way to
> >> express those relations and have a tool to set up Keyspace /
> >> ColumnFamily automatically as what django's SYNCDB does.
> >>
> >> According to my limited experience with Cassandra, now, we do more
> >> when we write, and less when we read/query. Hence I think the =
problem
> >> lies exactly in how we duplicate our data to do queries.
> >>
> >> Please correct me if I got these all wrong.
> >>
> >> >>
> >> >> On Fri, Apr 23, 2010 at 12:12 PM, aXqd  =
wrote:
> >> >>>
> >> >>> Hi, all:
> >> >>>
> >> >>> I know many people regard O/R Mapping as rubbish. However it is
> >> >>> undeniable that ORM is quite easy to use in most simple cases,
> >> >>> Meanwhile Cassandra is well known as No-SQL solution, a.k.a.
>

Re: ORM in Cassandra?

2010-04-26 Thread Jeff Hodges

There is, of course, also cassandra_object on the ruby side. I assume
this thread has the implicit requirement of Java, though.
--
Jeff

On Mon, Apr 26, 2010 at 10:26 AM, Isaac Arias  wrote:
> On Apr 26, 2010, at 12:13 PM, Geoffry Roberts wrote:
>
>> Clearly Cassandra is not an RDBMS.  The intent of my Hibernate
>> reference was to be more lyrical.  Sorry if that didn't come through.
>
>> Nonetheless, the need remains to relieve ourselves from excessive
>> boilerplate coding.
>
> I agree with eliminating boilerplate code. Chris Shorrock wrote a
> simple object mapper in Scala for his Cascal Cassandra client. You may
> want to check out the wiki on GitHub
> (http://wiki.github.com/shorrockin/cascal/).
>
> In my opinion, a mapping solution for Cassandra should be more like a
> Template. Something that helps map (back and forth) rows to objects,
> columns to properties, etc. Since the data model can vary so much
> depending on data access patters, any overly structured approach that
> prescribes a particular schema will be of limited use.
>
> If you're from the Java world, think of iBATIS vs. Hibernate.
>
>
>>
>> On Mon, Apr 26, 2010 at 9:00 AM, Ned Wolpert =
>  wrote:
>> I don't think you are trying to convert Cassandra to a RDBMS with what =
> you want. The issue is that finding a way to map these objects to =
> Cassandra in a meaningful way is hard. Its not as easy as saying 'do =
> what hibernate does' simply because its not an RDBMS...but it is a =
> reasonable and useful goal. I'm trying to accomplish this myself with =
> the grails Cassandra plugin.
>>=20
>>=20
>> On Fri, Apr 23, 2010 at 8:00 PM, aXqd  wrote:
>> On Sat, Apr 24, 2010 at 1:36 AM, Ned Wolpert =
>  wrote:
>> > There is nothing wrong with what you are asking. Some work has been =
> done to
>> > get an ORM layer ontop of cassandra, for example, with a RubyOnRails
>> > project. I'm trying to simplify cassandra integration with grails =
> with the
>> > plugin I'm writing.
>> > The problem is ORM solutions to date are wrapping a relational =
> database.
>> > (The 'R' in ORM) Cassandra isn't a relational database so it does =
> not map
>> > cleanly.
>>=20
>> Thanks. I noticed this problem before. I just want to know, in the
>> first place, what exactly is the right way to model relations in
>> Cassandra(a no-relational database).
>> So far, I still have those entities, and, without foreign keys, I use
>> relational entities, which contains the IDs of both sides of
>> relations.
>> In some other cases, I just duplicate data, and maintain the relations
>> manually by updating all the data in the same time.
>>=20
>> Is this the right way to go? Or what I am doing is still trying to
>> convert Cassandra to a RDBMS?
>>=20
>> >
>> > On Fri, Apr 23, 2010 at 1:29 AM, aXqd  wrote:
>> >>
>> >> On Fri, Apr 23, 2010 at 3:03 PM, Benoit Perroud =
> 
>> >> wrote:
>> >> > I understand the question more like : Is there already a lib =
> which
>> >> > help to get rid of writing hardcoded and hard to maintain lines =
> like :
>> >> >
>> >> > MyClass data;
>> >> > String[] myFields =3D {"name", "label", ...}
>> >> > List columns;
>> >> > for (String field : myFields) {
>> >> >    if (field =3D=3D "name") {
>> >> >       columns.add(new Column(field, data.getName()))
>> >> >    } else if (field =3D=3D "label") {
>> >> >      columns.add(new Column(field, data.getLabel()))
>> >> >    } else ...
>> >> > }
>> >> > (same for loading (instanciating) automagically the object).
>> >>
>> >> Yes, I am talking about this question.
>> >>
>> >> >
>> >> > Kind regards,
>> >> >
>> >> > Benoit.
>> >> >
>> >> > 2010/4/23 dir dir :
>> >> >>>So maybe it's weird to combine ORM and Cassandra, right? Is =
> there
>> >> >>>anything we can take from ORM?
>> >> >>
>> >> >> Honestly I do not understand what is your question. It is clear =
> that
>> >> >> you can not combine ORM such as Hibernate or iBATIS with =
> Cassandra.
>> >> >> Cassandra it self is not a RDBMS, so you will not map the table =
> into
>> >> >> the object.
>> >> >>
>> >> >> Dir.
>> >>
>> >> Sorry, English is not my mother tongue.
>> >>
>> >> I do understand I cannot combine ORM with Cassandra, because they =
> are
>> >> totally different ways for building our data model. But I think =
> there
>> >> are still something can be learnt from ORM to make Cassandra easier =
> to
>> >> use, just as what ORM did to RDBMS before.
>> >>
>> >> IMHO, domain model is still intact when we design our software, =
> hence
>> >> we need another way to map them to Cassandra's entity model. =
> Relation
>> >> does not just go away in this case, hence we need another way to
>> >> express those relations and have a tool to set up Keyspace /
>> >> ColumnFamily automatically as what django's SYNCDB does.
>> >>
>> >> According to my limited experience with Cassandra, now, we do more
>> >> when we write, and less when we read/query. Hence I think the =
> problem
>> >> lies exactly in how we duplicate our data to do queries.
>> >>
>> >> Please corr

Re: ORM in Cassandra?

2010-04-26 Thread Ethan Rowe


On 04/26/2010 01:26 PM, Isaac Arias wrote:

On Apr 26, 2010, at 12:13 PM, Geoffry Roberts wrote:

   

Clearly Cassandra is not an RDBMS.  The intent of my Hibernate
reference was to be more lyrical.  Sorry if that didn't come through.
 
   

Nonetheless, the need remains to relieve ourselves from excessive
boilerplate coding.
 

I agree with eliminating boilerplate code. Chris Shorrock wrote a
simple object mapper in Scala for his Cascal Cassandra client. You may
want to check out the wiki on GitHub
(http://wiki.github.com/shorrockin/cascal/).

In my opinion, a mapping solution for Cassandra should be more like a
Template. Something that helps map (back and forth) rows to objects,
columns to properties, etc. Since the data model can vary so much
depending on data access patters, any overly structured approach that
prescribes a particular schema will be of limited use.
   


For what it's worth, this is exactly my opinion after looking at the 
problem for a bit, and I'm actively developing such a solution in Ruby.  
I spent some time playing with the CassandraObject project, but felt 
that despite all the good work that went in there, it didn't feel to me 
like it fit the problem space in an idiomatic manner.  No criticism 
intended there; it seems to lean a little more towards a very structured 
schema, with less flexibility for things like collection attributes the 
members of which all have a key that matches a pattern (which is a use 
case we have).


So, for my approach, there's one project that gives metaprogramming 
semantics for building the mapping behavior you describe: build classes 
that are oriented towards mapping between simple JSON-like structures 
and full-blown business objects.  And a separate project that layers 
Cassandra specifics on top of that underlying mapper tool.


The rub being: it's for a client, and we're collectively sorting out the 
details for releasing the code in some useful, public manner.  But 
hopefully I'll get something useful out there for potential Ruby 
enthusiasts before too long.  Hopefully a week or two.


Thanks.
- Ethan

--
Ethan Rowe
End Point Corporation
et...@endpoint.com

Re: ORM in Cassandra?

2010-04-26 Thread banks

The real tragedy is that we have not created a new acronym for this yet...

OKVM... it makes more sense...


On Mon, Apr 26, 2010 at 10:35 AM, Ethan Rowe  wrote:

> On 04/26/2010 01:26 PM, Isaac Arias wrote:
>
>> On Apr 26, 2010, at 12:13 PM, Geoffry Roberts wrote:
>>
>>
>>
>>> Clearly Cassandra is not an RDBMS.  The intent of my Hibernate
>>> reference was to be more lyrical.  Sorry if that didn't come through.
>>>
>>>
>>
>>
>>> Nonetheless, the need remains to relieve ourselves from excessive
>>> boilerplate coding.
>>>
>>>
>> I agree with eliminating boilerplate code. Chris Shorrock wrote a
>> simple object mapper in Scala for his Cascal Cassandra client. You may
>> want to check out the wiki on GitHub
>> (http://wiki.github.com/shorrockin/cascal/).
>>
>> In my opinion, a mapping solution for Cassandra should be more like a
>> Template. Something that helps map (back and forth) rows to objects,
>> columns to properties, etc. Since the data model can vary so much
>> depending on data access patters, any overly structured approach that
>> prescribes a particular schema will be of limited use.
>>
>>
>
> For what it's worth, this is exactly my opinion after looking at the
> problem for a bit, and I'm actively developing such a solution in Ruby.  I
> spent some time playing with the CassandraObject project, but felt that
> despite all the good work that went in there, it didn't feel to me like it
> fit the problem space in an idiomatic manner.  No criticism intended there;
> it seems to lean a little more towards a very structured schema, with less
> flexibility for things like collection attributes the members of which all
> have a key that matches a pattern (which is a use case we have).
>
> So, for my approach, there's one project that gives metaprogramming
> semantics for building the mapping behavior you describe: build classes that
> are oriented towards mapping between simple JSON-like structures and
> full-blown business objects.  And a separate project that layers Cassandra
> specifics on top of that underlying mapper tool.
>
> The rub being: it's for a client, and we're collectively sorting out the
> details for releasing the code in some useful, public manner.  But hopefully
> I'll get something useful out there for potential Ruby enthusiasts before
> too long.  Hopefully a week or two.
>
> Thanks.
> - Ethan
>
> --
> Ethan Rowe
> End Point Corporation
> et...@endpoint.com
>
>

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Edmond Lau

Ryan -

You (or maybe someone else) mentioned using RAID-0 instead of multiple
data directories at the Cassandra hackathon as well.  Could you
explain the motivation behind that?

Thanks,
Edmond

On Mon, Apr 26, 2010 at 9:53 AM, Ryan King  wrote:
> I would recommend using RAID-0 rather that multiple data directories.
>
> -ryan
>
> 2010/4/26 Roland Hänel :
>> I have a configuration like this:
>>
>>   
>>   /storage01/cassandra/data
>>   /storage02/cassandra/data
>>   /storage03/cassandra/data
>>   
>>
>> After loading a big chunk of data into cassandra, I end up wich some 70GB in
>> the first directory, and only about 10GB in the second and third one. All
>> rows are quite small, so it's not just some big rows that contain the
>> majority of data.
>>
>> Does Cassandra have the ability to 'see' the maximum available space in
>> these directory? I'm asking myself this question since my limit is 100GB,
>> and the first directory is approaching this limit...
>>
>> And, wouldn't it be better if Cassandra tried to 'load-balance' the files
>> inside the directories because this will result in better (read) performance
>> if the directories are on different disks (which is the case for me)?
>>
>> Any help is appreciated.
>>
>> Roland
>>
>>
>

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Jonathan Ellis

http://wiki.apache.org/cassandra/CassandraHardware

On Mon, Apr 26, 2010 at 1:06 PM, Edmond Lau  wrote:
> Ryan -
>
> You (or maybe someone else) mentioned using RAID-0 instead of multiple
> data directories at the Cassandra hackathon as well.  Could you
> explain the motivation behind that?
>
> Thanks,
> Edmond
>
> On Mon, Apr 26, 2010 at 9:53 AM, Ryan King  wrote:
>> I would recommend using RAID-0 rather that multiple data directories.
>>
>> -ryan
>>
>> 2010/4/26 Roland Hänel :
>>> I have a configuration like this:
>>>
>>>   
>>>   /storage01/cassandra/data
>>>   /storage02/cassandra/data
>>>   /storage03/cassandra/data
>>>   
>>>
>>> After loading a big chunk of data into cassandra, I end up wich some 70GB in
>>> the first directory, and only about 10GB in the second and third one. All
>>> rows are quite small, so it's not just some big rows that contain the
>>> majority of data.
>>>
>>> Does Cassandra have the ability to 'see' the maximum available space in
>>> these directory? I'm asking myself this question since my limit is 100GB,
>>> and the first directory is approaching this limit...
>>>
>>> And, wouldn't it be better if Cassandra tried to 'load-balance' the files
>>> inside the directories because this will result in better (read) performance
>>> if the directories are on different disks (which is the case for me)?
>>>
>>> Any help is appreciated.
>>>
>>> Roland
>>>
>>>
>>
>

Re: range get over subcolumns on supercolumn family

2010-04-26 Thread Rafael Ribeiro

Just found the way...
keyRange start and end key will be the same and instead of specifying the
count and start on KeyRange it has to be specified on SliceRange and then
keySlices will come with a single key and a list of columns...



2010/4/25 Rafael Ribeiro 

> Hi all!
>
>  I am trying to do a paginated query on the subcolumns of a superfamily
> column but sincerely I am a little bit confused.
>  I have already been able to do a range query but only over the keys of a
> regular column family.
>  For the keys case I've been able to do so using the code below:
>
> KeyRange keyRange = new KeyRange(count);
> keyRange.setStart_key(startKey);
> keyRange.setEnd_key("");
>
> SliceRange range = new SliceRange();
> range.setStart(new byte[] {});
> range.setFinish(new byte[] {});
>
> SlicePredicate predicate = new SlicePredicate();
> predicate.setSlice_range(range);
>
> ColumnParent cp = new ColumnParent("ColumnFamily");
>
> List keySlices = client.get_range_slices("Keyspace",
> cp, predicate, keyRange, ConsistencyLevel.ALL);
>
>  Is there any way I can do a similar approach to do the range query on the
> subcolumns? Would I need to do some trick over ColumnParent? I tried setting
> the supercolumn attribute but with no success (sincerely I knew it wont work
> but it was worth trying). Only to clarify a little bit... I am still
> exercising what is possible to do with Cassandra and I was willing to store
> a key over a supercolumnfamily with uuid keys under it so I could scan it
> using an ordering scheme but without loading the whole data under the top
> level key.
>
> best regards,
> Rafael Ribeiro
>
>

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Roland Hänel

Hm... I understand that RAID0 would help to create a bigger pool for
compactions. However, it might impact read performance: if I have several
CF's (with their SSTables), random read requests for the CF files that are
on separate disks will behave nicely - however if it's RAID0 then a random
read on any file will create a random read on all of the hard disks.
Correct?

-Roland

2010/4/26 Jonathan Ellis 

> http://wiki.apache.org/cassandra/CassandraHardware
>
> On Mon, Apr 26, 2010 at 1:06 PM, Edmond Lau  wrote:
> > Ryan -
> >
> > You (or maybe someone else) mentioned using RAID-0 instead of multiple
> > data directories at the Cassandra hackathon as well.  Could you
> > explain the motivation behind that?
> >
> > Thanks,
> > Edmond
> >
> > On Mon, Apr 26, 2010 at 9:53 AM, Ryan King  wrote:
> >> I would recommend using RAID-0 rather that multiple data directories.
> >>
> >> -ryan
> >>
> >> 2010/4/26 Roland Hänel :
> >>> I have a configuration like this:
> >>>
> >>>   
> >>>   /storage01/cassandra/data
> >>>   /storage02/cassandra/data
> >>>   /storage03/cassandra/data
> >>>   
> >>>
> >>> After loading a big chunk of data into cassandra, I end up wich some
> 70GB in
> >>> the first directory, and only about 10GB in the second and third one.
> All
> >>> rows are quite small, so it's not just some big rows that contain the
> >>> majority of data.
> >>>
> >>> Does Cassandra have the ability to 'see' the maximum available space in
> >>> these directory? I'm asking myself this question since my limit is
> 100GB,
> >>> and the first directory is approaching this limit...
> >>>
> >>> And, wouldn't it be better if Cassandra tried to 'load-balance' the
> files
> >>> inside the directories because this will result in better (read)
> performance
> >>> if the directories are on different disks (which is the case for me)?
> >>>
> >>> Any help is appreciated.
> >>>
> >>> Roland
> >>>
> >>>
> >>
> >
>

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Ryan King

2010/4/26 Roland Hänel :
> Hm... I understand that RAID0 would help to create a bigger pool for
> compactions. However, it might impact read performance: if I have several
> CF's (with their SSTables), random read requests for the CF files that are
> on separate disks will behave nicely - however if it's RAID0 then a random
> read on any file will create a random read on all of the hard disks.
> Correct?

Without RAID0 you will end up with host spots (a compaction could end
up putting a large SSTable on one disk, while the others have smaller
SSTables). If you have many CFs this might average out, but it might
not and there are no guarantees here. I'd reccomend RAID0 unless you
have reason to do something else.

-ryan

Re: ORM in Cassandra?

2010-04-26 Thread Tatu Saloranta

On Mon, Apr 26, 2010 at 10:35 AM, Ethan Rowe  wrote:
> On 04/26/2010 01:26 PM, Isaac Arias wrote:
>>
>> On Apr 26, 2010, at 12:13 PM, Geoffry Roberts wrote:
>>
...
>> In my opinion, a mapping solution for Cassandra should be more like a
>> Template. Something that helps map (back and forth) rows to objects,
>> columns to properties, etc. Since the data model can vary so much
>> depending on data access patters, any overly structured approach that
>> prescribes a particular schema will be of limited use.
>>
> For what it's worth, this is exactly my opinion after looking at the problem
> for a bit, and I'm actively developing such a solution in Ruby.  I spent
...
> So, for my approach, there's one project that gives metaprogramming
> semantics for building the mapping behavior you describe: build classes that
> are oriented towards mapping between simple JSON-like structures and
> full-blown business objects.  And a separate project that layers Cassandra
> specifics on top of that underlying mapper tool.

+1

I think proper layering is the way to go: it makes problem (of simple
construction of services that use Cassandra as the storage system)
much easier to solve, divide and conquer. There are pretty decent
OJM/OXM solutions that are mostly orthogonal wrt distributed storage
part. I understand that there are some trade-offs (some things are
easiest to optimize when Cassandra core handles them), but flexibility
and best-tool-for-the-job have their benefits too.

-+ Tatu +-

Cassandra cluster runs into OOM when bulk loading data

2010-04-26 Thread Roland Hänel

I have a cluster of 5 machines building a Cassandra datastore, and I load
bulk data into this using the Java Thrift API. The first ~250GB runs fine,
then, one of the nodes starts to throw OutOfMemory exceptions. I'm not using
and row or index caches, and since I only have 5 CF's and some 2,5 GB of RAM
allocated to the JVM (-Xmx2500M), in theory, that should happen. All inserts
are done with consistency level ALL.

I hope with this I have avoided all the 'usual dummy errors' that lead to
OOM's. I have begun to troubleshoot the issue with JMX, however, it's
difficult to catch the JVM in the right moment because it runs well for
several hours before this thing happens.

One thing gets to my mind, maybe one of the experts could confirm or reject
this idea for me: is it possible that when one machine slows down a little
bit (for example because a big compaction is going on), the memtables don't
get flushed to disk as fast as they are building up under the continuing
bulk import? That would result in a downward spiral, the system gets slower
and slower on disk I/O, but since more and more data arrives over Thrift,
finally OOM.

I'm using the "periodic" commit log sync, maybe also this could create a
situation where the commit log writer is too slow to catch up with the data
intake, resulting in ever growing memory usage?

Maybe these thoughts are just bullshit. Let me now if so... ;-)

Re: Cassandra cluster runs into OOM when bulk loading data

2010-04-26 Thread Chris Goffinet

Which version of Cassandra?
Which version of Java JVM are you using?
What do your I/O stats look like when bulk importing?
When you run `nodeprobe -host  tpstats` is any thread pool backing up
during the import?

-Chris


2010/4/26 Roland Hänel 

> I have a cluster of 5 machines building a Cassandra datastore, and I load
> bulk data into this using the Java Thrift API. The first ~250GB runs fine,
> then, one of the nodes starts to throw OutOfMemory exceptions. I'm not using
> and row or index caches, and since I only have 5 CF's and some 2,5 GB of RAM
> allocated to the JVM (-Xmx2500M), in theory, that should happen. All inserts
> are done with consistency level ALL.
>
> I hope with this I have avoided all the 'usual dummy errors' that lead to
> OOM's. I have begun to troubleshoot the issue with JMX, however, it's
> difficult to catch the JVM in the right moment because it runs well for
> several hours before this thing happens.
>
> One thing gets to my mind, maybe one of the experts could confirm or reject
> this idea for me: is it possible that when one machine slows down a little
> bit (for example because a big compaction is going on), the memtables don't
> get flushed to disk as fast as they are building up under the continuing
> bulk import? That would result in a downward spiral, the system gets slower
> and slower on disk I/O, but since more and more data arrives over Thrift,
> finally OOM.
>
> I'm using the "periodic" commit log sync, maybe also this could create a
> situation where the commit log writer is too slow to catch up with the data
> intake, resulting in ever growing memory usage?
>
> Maybe these thoughts are just bullshit. Let me now if so... ;-)
>
>
>

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Roland Hänel

Ryan, I agree with you on the hot spots, however for the physical disk
performance, even the worst case hot spot is not worse than RAID0: in a hot
spot scenario, it might be that 90% of your reads go to one hard drive. But
with RAID0, 100% of your reads will go to *all* hard drives.

But you're right, individual disks might waste up to 50% of your total disk
space...

I came to consider this idea because Hadoop DFS explicitely recommends
different disks. But the design is not exactly the same, they don't have to
deal with very big files on the native FS layer.

-Roland



2010/4/26 Ryan King 

> 2010/4/26 Roland Hänel :
> > Hm... I understand that RAID0 would help to create a bigger pool for
> > compactions. However, it might impact read performance: if I have several
> > CF's (with their SSTables), random read requests for the CF files that
> are
> > on separate disks will behave nicely - however if it's RAID0 then a
> random
> > read on any file will create a random read on all of the hard disks.
> > Correct?
>
> Without RAID0 you will end up with host spots (a compaction could end
> up putting a large SSTable on one disk, while the others have smaller
> SSTables). If you have many CFs this might average out, but it might
> not and there are no guarantees here. I'd reccomend RAID0 unless you
> have reason to do something else.
>
> -ryan
>

Re: Question about TimeUUIDType

2010-04-26 Thread Tatu Saloranta

On Sun, Apr 25, 2010 at 5:43 PM, Jonathan Ellis  wrote:
> On Sun, Apr 25, 2010 at 5:40 PM, Tatu Saloranta  wrote:
>>> Now with TimeUUIDType, if two UUID have the same timestamps, they are 
>>> ordered
>>> by bytes order.
>>
>> Naively for the whole UUID? That would not be good, given that
>> timestamp within UUID is not stored in expected lexical order, but
>> with sort of little-endian mess (first bytes are least-significant
>> bytes of timestamp).
>
> I think the code here is clearer than explaining in English. :)
>
> comparing timeuuids o1 and o2:
>
>        long t1 = LexicalUUIDType.getUUID(o1).timestamp();
>        long t2 = LexicalUUIDType.getUUID(o2).timestamp();
>        return t1 < t2 ? -1 : (t1 > t2 ? 1 :
> FBUtilities.compareByteArrays(o1, o2));

:-)

Yes, this makes sense, so it is a two-part sort, not just latter part.

-+ Tatu +-

ps. Not sure if this matters, but I am finally working on Java Uuid
Generator v3, which might help with time-location based UUIDs. Will
announce it on the list when it's ready (in couple of weeks)

Re: ORM in Cassandra?

2010-04-26 Thread Ethan Rowe


On 04/26/2010 03:11 PM, Tatu Saloranta wrote:

On Mon, Apr 26, 2010 at 10:35 AM, Ethan Rowe  wrote:


On 04/26/2010 01:26 PM, Isaac Arias wrote:


On Apr 26, 2010, at 12:13 PM, Geoffry Roberts wrote:



...


In my opinion, a mapping solution for Cassandra should be more like a
Template. Something that helps map (back and forth) rows to objects,
columns to properties, etc. Since the data model can vary so much
depending on data access patters, any overly structured approach that
prescribes a particular schema will be of limited use.



For what it's worth, this is exactly my opinion after looking at the problem
for a bit, and I'm actively developing such a solution in Ruby.  I spent


...


So, for my approach, there's one project that gives metaprogramming
semantics for building the mapping behavior you describe: build classes that
are oriented towards mapping between simple JSON-like structures and
full-blown business objects.  And a separate project that layers Cassandra
specifics on top of that underlying mapper tool.


+1

I think proper layering is the way to go: it makes problem (of simple
construction of services that use Cassandra as the storage system)
much easier to solve, divide and conquer. There are pretty decent
OJM/OXM solutions that are mostly orthogonal wrt distributed storage
part. I understand that there are some trade-offs (some things are
easiest to optimize when Cassandra core handles them), but flexibility
and best-tool-for-the-job have their benefits too.



Right.  Additionally, this mapping layer between "simple" (i.e. 
JSON-ready) structures and "complex" (i.e. business objects) would seem 
to be of much more general value than a Cassandra-specific mapper.  I 
would think most any environment with a heavy reliance on Thrift 
services would benefit from such tools.


--
Ethan Rowe
End Point Corporation
et...@endpoint.com

Re: Cassandra cluster runs into OOM when bulk loading data

2010-04-26 Thread Roland Hänel

Cassandra Version 0.6.1
OpenJDK Server VM (build 14.0-b16, mixed mode)
Import speed is about 10MB/s for the full cluster; if a compaction is going
on the individual node is I/O limited
tpstats: caught me, didn't know this. I will set up a test and try to catch
a node during the critical time.

Thanks,
Roland


2010/4/26 Chris Goffinet 

> Which version of Cassandra?
> Which version of Java JVM are you using?
> What do your I/O stats look like when bulk importing?
> When you run `nodeprobe -host  tpstats` is any thread pool backing up
> during the import?
>
> -Chris
>
>
> 2010/4/26 Roland Hänel 
>
> I have a cluster of 5 machines building a Cassandra datastore, and I load
>> bulk data into this using the Java Thrift API. The first ~250GB runs fine,
>> then, one of the nodes starts to throw OutOfMemory exceptions. I'm not using
>> and row or index caches, and since I only have 5 CF's and some 2,5 GB of RAM
>> allocated to the JVM (-Xmx2500M), in theory, that should happen. All inserts
>> are done with consistency level ALL.
>>
>> I hope with this I have avoided all the 'usual dummy errors' that lead to
>> OOM's. I have begun to troubleshoot the issue with JMX, however, it's
>> difficult to catch the JVM in the right moment because it runs well for
>> several hours before this thing happens.
>>
>> One thing gets to my mind, maybe one of the experts could confirm or
>> reject this idea for me: is it possible that when one machine slows down a
>> little bit (for example because a big compaction is going on), the memtables
>> don't get flushed to disk as fast as they are building up under the
>> continuing bulk import? That would result in a downward spiral, the system
>> gets slower and slower on disk I/O, but since more and more data arrives
>> over Thrift, finally OOM.
>>
>> I'm using the "periodic" commit log sync, maybe also this could create a
>> situation where the commit log writer is too slow to catch up with the data
>> intake, resulting in ever growing memory usage?
>>
>> Maybe these thoughts are just bullshit. Let me now if so... ;-)
>>
>>
>>
>

Re: Cassandra cluster runs into OOM when bulk loading data

2010-04-26 Thread Chris Goffinet

Upgrade to b20 of Sun's version of JVM. This OOM might be related to
LinkedBlockQueue issues that were fixed.

-Chris


2010/4/26 Roland Hänel 

> Cassandra Version 0.6.1
> OpenJDK Server VM (build 14.0-b16, mixed mode)
> Import speed is about 10MB/s for the full cluster; if a compaction is going
> on the individual node is I/O limited
> tpstats: caught me, didn't know this. I will set up a test and try to catch
> a node during the critical time.
>
> Thanks,
> Roland
>
>
> 2010/4/26 Chris Goffinet 
>
>  Which version of Cassandra?
>> Which version of Java JVM are you using?
>> What do your I/O stats look like when bulk importing?
>> When you run `nodeprobe -host  tpstats` is any thread pool backing up
>> during the import?
>>
>> -Chris
>>
>>
>> 2010/4/26 Roland Hänel 
>>
>> I have a cluster of 5 machines building a Cassandra datastore, and I load
>>> bulk data into this using the Java Thrift API. The first ~250GB runs fine,
>>> then, one of the nodes starts to throw OutOfMemory exceptions. I'm not using
>>> and row or index caches, and since I only have 5 CF's and some 2,5 GB of RAM
>>> allocated to the JVM (-Xmx2500M), in theory, that should happen. All inserts
>>> are done with consistency level ALL.
>>>
>>> I hope with this I have avoided all the 'usual dummy errors' that lead to
>>> OOM's. I have begun to troubleshoot the issue with JMX, however, it's
>>> difficult to catch the JVM in the right moment because it runs well for
>>> several hours before this thing happens.
>>>
>>> One thing gets to my mind, maybe one of the experts could confirm or
>>> reject this idea for me: is it possible that when one machine slows down a
>>> little bit (for example because a big compaction is going on), the memtables
>>> don't get flushed to disk as fast as they are building up under the
>>> continuing bulk import? That would result in a downward spiral, the system
>>> gets slower and slower on disk I/O, but since more and more data arrives
>>> over Thrift, finally OOM.
>>>
>>> I'm using the "periodic" commit log sync, maybe also this could create a
>>> situation where the commit log writer is too slow to catch up with the data
>>> intake, resulting in ever growing memory usage?
>>>
>>> Maybe these thoughts are just bullshit. Let me now if so... ;-)
>>>
>>>
>>>
>>
>

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Paul Prescod

2010/4/26 Roland Hänel :
> Ryan, I agree with you on the hot spots, however for the physical disk
> performance, even the worst case hot spot is not worse than RAID0: in a hot
> spot scenario, it might be that 90% of your reads go to one hard drive. But
> with RAID0, 100% of your reads will go to *all* hard drives.

RAID0 is designed specifically to improve performance (both latency
and bandwidth). I'm unclear about why you think it would decrease
importance. Perhaps you're thinking of another RAID type?

 Paul Prescod

Re: Cassandra cluster runs into OOM when bulk loading data

2010-04-26 Thread Roland Hänel

Thanks Chris

2010/4/26 Chris Goffinet 

> Upgrade to b20 of Sun's version of JVM. This OOM might be related to
> LinkedBlockQueue issues that were fixed.
>
> -Chris
>
>
> 2010/4/26 Roland Hänel 
>
>> Cassandra Version 0.6.1
>> OpenJDK Server VM (build 14.0-b16, mixed mode)
>> Import speed is about 10MB/s for the full cluster; if a compaction is
>> going on the individual node is I/O limited
>> tpstats: caught me, didn't know this. I will set up a test and try to
>> catch a node during the critical time.
>>
>> Thanks,
>> Roland
>>
>>
>> 2010/4/26 Chris Goffinet 
>>
>>  Which version of Cassandra?
>>> Which version of Java JVM are you using?
>>> What do your I/O stats look like when bulk importing?
>>> When you run `nodeprobe -host  tpstats` is any thread pool backing up
>>> during the import?
>>>
>>> -Chris
>>>
>>>
>>> 2010/4/26 Roland Hänel 
>>>
>>> I have a cluster of 5 machines building a Cassandra datastore, and I load
 bulk data into this using the Java Thrift API. The first ~250GB runs fine,
 then, one of the nodes starts to throw OutOfMemory exceptions. I'm not 
 using
 and row or index caches, and since I only have 5 CF's and some 2,5 GB of 
 RAM
 allocated to the JVM (-Xmx2500M), in theory, that should happen. All 
 inserts
 are done with consistency level ALL.

 I hope with this I have avoided all the 'usual dummy errors' that lead
 to OOM's. I have begun to troubleshoot the issue with JMX, however, it's
 difficult to catch the JVM in the right moment because it runs well for
 several hours before this thing happens.

 One thing gets to my mind, maybe one of the experts could confirm or
 reject this idea for me: is it possible that when one machine slows down a
 little bit (for example because a big compaction is going on), the 
 memtables
 don't get flushed to disk as fast as they are building up under the
 continuing bulk import? That would result in a downward spiral, the system
 gets slower and slower on disk I/O, but since more and more data arrives
 over Thrift, finally OOM.

 I'm using the "periodic" commit log sync, maybe also this could create a
 situation where the commit log writer is too slow to catch up with the data
 intake, resulting in ever growing memory usage?

 Maybe these thoughts are just bullshit. Let me now if so... ;-)



>>>
>>
>

Re: The Difference Between Cassandra and HBase

2010-04-26 Thread Masood Mortazavi

On Sat, Apr 24, 2010 at 10:20 AM, dir dir  wrote:

> In general what is the difference between Cassandra and HBase??
>
> Thanks.
>

Others have already said it ...

Cassandra has a peer architecture, with all peers being essentially
equivalent (minus the concept of a "seed," as far as I can tell).

This is a great architectural advantage of Cassandra and Cassandra-like
systems. It wasn't really possible to make practical systems like this in
earlier ages because of computing (memory, CPU, disk) limitations which made
characteristic times (including expected characteristic response, recovery,
replication, etc. times) and system dynamics almost impossible to deal with.
This problem persists but has become far more manageable because expected
response times haven't evolved or narrowed any faster than computational
capabilities.

HBase on the other hand is a layered system already. It relies on the
underlying HDFS, beyond and above the OS. As a more layered systems, it has
better service architecture, in a sense, but it relies and is limited to the
capabilities of those "services" ... say the distributed file service.

Cassandra rolls its own partitioning and replication mechanisms at the level
of its peers. It does not rely on some underlying system service for these
capabilities. Cassandra is definitely easier to provision and use, from an
operational point of view, and this is a great advantage -- although
installations that afford scanning (through ordered partitioning) would
become more involved.

(As suggested by others, reading the BigTable and Dynamo paper will help you
to establish the difference between HBase and Cassandra in more clear,
architectural terms.)

- m.

Announcing Riptano professional Cassandra support and services

2010-04-26 Thread Jonathan Ellis

Short version: Matt Pfeil and I have founded http://riptano.com to
provide production Cassandra support, training, and professional
services.  Yes, we're hiring.

Long version: 
http://spyced.blogspot.com/2010/04/and-now-for-something-completely.html

We're happy to answer questions on- or off-list.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Roland Hänel

RAID0 decreases the performance of muliple, concurrent random reads because
for each read request (I assume that at least a couple of stripe sizes are
read), all hard disks are involved in that read.

Consider the following example: you want to read 1MB out of each of two
files

a) both files are on the same RAID0 of two disks. For the first 1MB read
request, both disks contain some stripes of this request, both disks have to
move their heads to the correct location and do the read. The second read
request has to wait until the first one finishes, because it is served from
the same disks and depends on the same disk heads.

b) files are on seperate disks. Both reads can be done at the same time,
because disk heads can move independently.

Or look at it this way: if you issue a read request on a RAID0, and your
disks have 8ms access time, then after the read request, the whole RAID0 is
completely blocked for 8ms. If you handle the disks independently, only the
disk containing the file is blocked.

RAID0 has its advantages of course. Streaming reads/writes (e.g. during a
compaction) will be extremely fast.

-Roland


2010/4/26 Paul Prescod 

> 2010/4/26 Roland Hänel :
> > Ryan, I agree with you on the hot spots, however for the physical disk
> > performance, even the worst case hot spot is not worse than RAID0: in a
> hot
> > spot scenario, it might be that 90% of your reads go to one hard drive.
> But
> > with RAID0, 100% of your reads will go to *all* hard drives.
>
> RAID0 is designed specifically to improve performance (both latency
> and bandwidth). I'm unclear about why you think it would decrease
> importance. Perhaps you're thinking of another RAID type?
>
>  Paul Prescod
>

How to generate 'unique' identifiers for use in Cassandra

2010-04-26 Thread Roland Hänel

Typically, in the SQL world we use things like AUTO_INCREMENT columns that
let us create a unique key automatically if a row is inserted into a table.

What do you guys usually do to create identifiers for use in Cassandra?

Do we only rely on "currentTimeMills() + random()" to create something that
is 'unique enough' (but theoretically not fail-safe)? Or are some people
here using systems like ZooKeeper for this purpose?

-Roland

Re: How to generate 'unique' identifiers for use in Cassandra

2010-04-26 Thread Miguel Verde

http://wiki.apache.org/cassandra/UUID if you don't need transactional
ordering, ZooKeeper or something comparable if you do.


2010/4/26 Roland Hänel 

> Typically, in the SQL world we use things like AUTO_INCREMENT columns that
> let us create a unique key automatically if a row is inserted into a table.
>
> What do you guys usually do to create identifiers for use in Cassandra?
>
> Do we only rely on "currentTimeMills() + random()" to create something that
> is 'unique enough' (but theoretically not fail-safe)? Or are some people
> here using systems like ZooKeeper for this purpose?
>
> -Roland
>
>

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Anthony Molinaro

I think it might be worse case that you read all the disks. If your
block size is large enough to hold an entire row, you should only have to
read one disk to get that data.

I for instance, stopped using multiple data directories and instead use
a RAID0.  The number of blocks read is not the same for all the disks
as you suggest it would be if every disk was involved in every transaction.

Device:tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda1 11.80 1.60   105.60  8528
sdb  17.20   867.20 0.00   4336  0
sdc   2.60 0.00   155.20  0776
sdd  16.40   796.80 0.00   3984  0
sde  21.80  1113.60 8.00   5568 40
md0  56.00  2777.60 8.00  13888 40

sdb, sdd and sdd are raided on md0 on an ec2 xlarge instance, the number
of blockes is different.

Of course my rows are small (1-2 Kb), so I should rarely cross a block
boundary, with 1MB rows you are more likely to, so multiple data directories
might be better for you.

I think it all sort of depends on your data size.

-Anthony

On Mon, Apr 26, 2010 at 10:09:58PM +0200, Roland H?nel wrote:
> RAID0 decreases the performance of muliple, concurrent random reads because
> for each read request (I assume that at least a couple of stripe sizes are
> read), all hard disks are involved in that read.
> 
> Consider the following example: you want to read 1MB out of each of two
> files
> 
> a) both files are on the same RAID0 of two disks. For the first 1MB read
> request, both disks contain some stripes of this request, both disks have to
> move their heads to the correct location and do the read. The second read
> request has to wait until the first one finishes, because it is served from
> the same disks and depends on the same disk heads.
> 
> b) files are on seperate disks. Both reads can be done at the same time,
> because disk heads can move independently.
> 
> Or look at it this way: if you issue a read request on a RAID0, and your
> disks have 8ms access time, then after the read request, the whole RAID0 is
> completely blocked for 8ms. If you handle the disks independently, only the
> disk containing the file is blocked.
> 
> RAID0 has its advantages of course. Streaming reads/writes (e.g. during a
> compaction) will be extremely fast.
> 
> -Roland
> 
> 
> 2010/4/26 Paul Prescod 
> 
> > 2010/4/26 Roland Hänel :
> > > Ryan, I agree with you on the hot spots, however for the physical disk
> > > performance, even the worst case hot spot is not worse than RAID0: in a
> > hot
> > > spot scenario, it might be that 90% of your reads go to one hard drive.
> > But
> > > with RAID0, 100% of your reads will go to *all* hard drives.
> >
> > RAID0 is designed specifically to improve performance (both latency
> > and bandwidth). I'm unclear about why you think it would decrease
> > importance. Perhaps you're thinking of another RAID type?
> >
> >  Paul Prescod
> >

-- 

Anthony Molinaro

Re: Can Cassandra make real use of several DataFileDirectories?

2010-04-26 Thread Paul Prescod

On Mon, Apr 26, 2010 at 2:15 PM, Anthony Molinaro
 wrote:
> I think it might be worse case that you read all the disks. If your
> block size is large enough to hold an entire row, you should only have to
> read one disk to get that data.

And conversely, for a large enough row you might benefit from
streaming from two disks at once rather than one.

 Paul

Re: strange get_range_slices behaviour v0.6.1

2010-04-26 Thread aaron


I've broken this case down further to some pyton code that works against
the thrift generated 
client and am still getting the same odd results. With keys obejct1,
object2 and object3 an 
open ended get_range_slice starting with "object1" only returns object1 and
2. 

I'm guessing that I've got something wrong or my expectation of how
get_range_slice works 
is wrong, but I cannot see where I've gone wrong. Any help would be
appreciated.

They python code to add and read keys is below, assumes a Cassandra.Client
connection.

import time
from cassandra import Cassandra,ttypes
from thrift import Thrift
from thrift.protocol import TBinaryProtocol
from thrift.transport import TSocket, TTransport


def add_data(conn):

col_path = ttypes.ColumnPath(column_family="Standard1",
column="col_name")
consistency = ttypes.ConsistencyLevel.QUORUM

for key in ["object1", "object2", "object3"]:
conn.insert("Keyspace1", key, col_path, "col_value",
int(time.time() * 1e6), consistency)
return

def read_range(conn, start_key, end_key):

col_parent = ttypes.ColumnParent(column_family="Standard1")

predicate = ttypes.SlicePredicate(column_names=["col_name"])
range = ttypes.KeyRange(start_key=start_key, end_key=end_key,
count=1000)
consistency = ttypes.ConsistencyLevel.QUORUM

return conn.get_range_slices("Keyspace1", col_parent,
predicate, range, consistency)


Below is the result of calling read_range with different start values. I've
also included 
the debug log for each call, the line starting with "reading
RangeSliceCommand" seems to 
show that key hash for "object2" is greater than "object3". 

#expect to return objects 1,2 and 3

In [37]: cass_test.read_range(conn, "object1", "")
Out[37]:
[KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272315595268837,
name='col_name', value='col_value'), super_column=None)], key='object1'),
 KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272315595272693,
name='col_name', value='col_value'), super_column=None)], key='object3')]

DEBUG 09:29:59,791 range_slice
DEBUG 09:29:59,791 RangeSliceCommand{keyspace='Keyspace1',
column_family='Standard1', super_column=null,
predicate=SlicePredicate(column_names:[...@257b40fe]),
range=[121587881847328893689247922008234581399,0], max_keys=1000}
DEBUG 09:29:59,791 Adding to restricted ranges
[121587881847328893689247922008234581399,0] for
(75349581786326521367945210761838448174,75349581786326521367945210761838448174]
DEBUG 09:29:59,791 reading RangeSliceCommand{keyspace='Keyspace1',
column_family='Standard1', super_column=null,
predicate=SlicePredicate(column_names:[...@257b40fe]),
range=[121587881847328893689247922008234581399,0], max_keys=1000} from
1...@localhost/127.0.0.1
DEBUG 09:29:59,791 Sending RangeSliceReply{rows=Row(key='object1',
cf=ColumnFamily(Standard1
[636f6c5f6e616d65:false:9...@1272315595268837,])),Row(key='object3',
cf=ColumnFamily(Standard1 [636f6c5f6e616d65:false:9...@1272315595272693,]))}
to 1...@localhost/127.0.0.1
DEBUG 09:29:59,791 Processing response on a callback from
1...@localhost/127.0.0.1
DEBUG 09:29:59,791 range slices read object1
DEBUG 09:29:59,791 range slices read object3


In [38]: cass_test.read_range(conn, "object2", "")
Out[38]:
[KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272315595271798,
name='col_name', value='col_value'), super_column=None)], key='object2'),
 KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272315595268837,
name='col_name', value='col_value'), super_column=None)], key='object1'),
 KeySlice(columns=[ColumnOrSuperColumn(column=Column(timestamp=1272315595272693,
name='col_name', value='col_value'), super_column=None)], key='object3')]

DEBUG 09:34:48,133 range_slice
DEBUG 09:34:48,133 RangeSliceCommand{keyspace='Keyspace1',
column_family='Standard1', super_column=null,
predicate=SlicePredicate(column_names:[...@7966340c]),
range=[28312518014678916505369931620527723964,0], max_keys=1000}
DEBUG 09:34:48,133 Adding to restricted ranges
[28312518014678916505369931620527723964,0] for
(75349581786326521367945210761838448174,75349581786326521367945210761838448174]
DEBUG 09:34:48,133 reading RangeSliceCommand{keyspace='Keyspace1',
column_family='Standard1', super_column=null,
predicate=SlicePredicate(column_names:[...@7966340c]),
range=[28312518014678916505369931620527723964,0], max_keys=1000} from
1...@localhost/127.0.0.1
DEBUG 09:34:48,133 Sending RangeSliceReply{rows=Row(key='object2',
cf=ColumnFamily(Standard1
[636f6c5f6e616d65:false:9...@1272315595271798,])),Row(key='object1',
cf=ColumnFamily(Standard1
[636f6c5f6e616d65:false:9...@1272315595268837,])),Row(key='object3',
cf=ColumnFamily(Standard1 [636f6c5f6e616d65:false:9...@1272315595272693,]))}
to 1...@localhost/127.0.0.1
DEBUG 09:34:48,133 Processing response on a callback from
1...@localhost/127.0.0.1
DEBUG 09:34:48,133 range slices read object2
DEBUG 09:34:48,133 range slices read object1
DEBUG 09:34:48,133 rang

Quorom consistency in a changing ring

2010-04-26 Thread Peter Schuller

Hello,

Is my interpretation correct that Cassandra is intended to guarantee
quorom consistency (overlapping read/write sets) at all times,
including a ring that is actively changing? I.e., there are no
(intended) cases where qurom consistency is defeated due to writes or
reads going to nodes that are actively participating in token:s
moving?

If yes, is there any material on how this is accomplished and/or
pointers to roughly which parts of the implementation is responsible
for ensuring this works?

Thanks!

-- 
/ Peter Schuller

Re: Quorom consistency in a changing ring

2010-04-26 Thread David Timothy Strauss

Increasing the replication level is known to break it.

--Original Message--
From: Peter Schuller
Sender: sc...@scode.org
To: user@cassandra.apache.org
ReplyTo: user@cassandra.apache.org
Subject: Quorom consistency in a changing ring
Sent: Apr 26, 2010 21:55

Hello,

Is my interpretation correct that Cassandra is intended to guarantee
quorom consistency (overlapping read/write sets) at all times,
including a ring that is actively changing? I.e., there are no
(intended) cases where qurom consistency is defeated due to writes or
reads going to nodes that are actively participating in token:s
moving?

If yes, is there any material on how this is accomplished and/or
pointers to roughly which parts of the implementation is responsible
for ensuring this works?

Thanks!

-- 
/ Peter Schuller

Re: Quorom consistency in a changing ring

2010-04-26 Thread Peter Schüller

> Increasing the replication level is known to break it.

Thanks! Yes, of that I am aware. When I said ring changes I meant
nodes being added and removed, or just re-balanced, implying tokens
moving around the ring.

-- 
/ Peter Schuller aka scode

Re: ORM in Cassandra?

2010-04-26 Thread Paul Bohm

I call tragedy a 'Cassandra Object Abstraction' (COA), because I try
write a reusable implementation of patterns that are commonly used for
cassandra data modeling. E.g. using TimeUUID columns for storing an
Index is a pattern. Then various strategies to partition these Indexes
are another pattern.

I'm hoping that after some iteration a good mix of high-level
abstractions that can be reused for all kinds of apps will emerge. It
feels ambitious to me to try to implement cross-nosql-store
abstractions before these patterns and best practices have been
documented and battle-proven. On that note, if such documentation does
exist, or you know cool patterns, i'd love to hear about them!

Paul

On Mon, Apr 26, 2010 at 10:46 AM, banks  wrote:
> The real tragedy is that we have not created a new acronym for this yet...
>
> OKVM... it makes more sense...
>
>
> On Mon, Apr 26, 2010 at 10:35 AM, Ethan Rowe  wrote:
>>
>> On 04/26/2010 01:26 PM, Isaac Arias wrote:
>>>
>>> On Apr 26, 2010, at 12:13 PM, Geoffry Roberts wrote:
>>>
>>>

 Clearly Cassandra is not an RDBMS.  The intent of my Hibernate
 reference was to be more lyrical.  Sorry if that didn't come through.

>>>
>>>

 Nonetheless, the need remains to relieve ourselves from excessive
 boilerplate coding.

>>>
>>> I agree with eliminating boilerplate code. Chris Shorrock wrote a
>>> simple object mapper in Scala for his Cascal Cassandra client. You may
>>> want to check out the wiki on GitHub
>>> (http://wiki.github.com/shorrockin/cascal/).
>>>
>>> In my opinion, a mapping solution for Cassandra should be more like a
>>> Template. Something that helps map (back and forth) rows to objects,
>>> columns to properties, etc. Since the data model can vary so much
>>> depending on data access patters, any overly structured approach that
>>> prescribes a particular schema will be of limited use.
>>>
>>
>> For what it's worth, this is exactly my opinion after looking at the
>> problem for a bit, and I'm actively developing such a solution in Ruby.  I
>> spent some time playing with the CassandraObject project, but felt that
>> despite all the good work that went in there, it didn't feel to me like it
>> fit the problem space in an idiomatic manner.  No criticism intended there;
>> it seems to lean a little more towards a very structured schema, with less
>> flexibility for things like collection attributes the members of which all
>> have a key that matches a pattern (which is a use case we have).
>>
>> So, for my approach, there's one project that gives metaprogramming
>> semantics for building the mapping behavior you describe: build classes that
>> are oriented towards mapping between simple JSON-like structures and
>> full-blown business objects.  And a separate project that layers Cassandra
>> specifics on top of that underlying mapper tool.
>>
>> The rub being: it's for a client, and we're collectively sorting out the
>> details for releasing the code in some useful, public manner.  But hopefully
>> I'll get something useful out there for potential Ruby enthusiasts before
>> too long.  Hopefully a week or two.
>>
>> Thanks.
>> - Ethan
>>
>> --
>> Ethan Rowe
>> End Point Corporation
>> et...@endpoint.com
>>
>
>

Re: Quorom consistency in a changing ring

2010-04-26 Thread Benjamin Black

Live nodes that have tokens indicating they should receive a copy of
data count towards write quorum.  This means if a node is down (not
decommissioned) the copy sent to the node acting as the hinted handoff
replica will not count towards achieving quorum.  If a token is moved,
it is moved.  It is not in 2 places at once.  If you are using
CL.QUORUM and it succeeds, it really is reading or writing RF / 2 + 1
copies.


b

2010/4/26 Peter Schüller :
>> Increasing the replication level is known to break it.
>
> Thanks! Yes, of that I am aware. When I said ring changes I meant
> nodes being added and removed, or just re-balanced, implying tokens
> moving around the ring.
>
> --
> / Peter Schuller aka scode
>

how to get apache cassandra version with thrift client ?

2010-04-26 Thread Shuge Lee

Hi all:

How to get apache cassandra version with thrift client ?

Thanks for reply.

-- 
Shuge Lee | Lee Li | 李蠡

Re: Super and Regular Columns

2010-04-26 Thread Jonathan Ellis

On Fri, Apr 23, 2010 at 3:32 PM, Robert  wrote:
> I am starting out with Cassandra and I had a couple of questions, I read a
> lot of the documentation including:
> http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
> First I wanted to make sure I understand this
> bug: http://issues.apache.org/jira/browse/CASSANDRA-598
> Borrowing from the the example provided in that article, would an example
> subcolumn be 'friend1' or 'street'?

friend1 is the name a a supercolumn; street is the name of a subcolumn

> Second, for a one to many map where ordering is not important what are the
> tradeoffs between these two options?
>
> A. Use a ColumnFamily where the key maps to an item id, and in each row each
> column is one of the items it is mapped to?
>
> B. Use SuperColumnFamily where each key is an item id, and each column (are
> these the right terms?) is one of the items it is mapped to, and the value
> is essentially empty?

I don't see what using supercolumns gives you here, so don't use them. :)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra reverting deletes?

2010-04-26 Thread Jonathan Ellis

How are you checking that the rows are gone?

Are you experiencing node outages during this?

DC_QUORUM is unfinished code right now, you should avoid using it.
Can you reproduce with normal QUORUM?

On Sat, Apr 24, 2010 at 12:23 PM, Joost Ouwerkerk  wrote:
> I'm having trouble deleting rows in Cassandra.  After running a job that
> deletes hundreds of rows, I run another job that verifies that the rows are
> gone.  Both jobs run correctly.  However, when I run the verification job an
> hour later, the rows have re-appeared.  This is not a case of "ghosting"
> because the verification job actually checks that there is data in the
> columns.
>
> I am running a cluster with 12 nodes and a replication factor of 3.  I am
> using DC_QUORUM consistency when deleting.
>
> Any ideas?
> Joost.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra use cases: as a datagrid ? as a distributed cache ?

2010-04-26 Thread Jonathan Ellis

On Mon, Apr 26, 2010 at 9:04 AM, Dominique De Vito
 wrote:
> (1) has anyone already used Cassandra as an in-memory data grid ?
> If no, does anyone know how far such a database is from, let's say, Oracle
> Coherence ?
> Does Cassandra provide, for example, a (synchronized) cache on the client
> side ?

If you mean an in-process cache on the client side, no.

> (2) has anyone already used Cassandra as a distributed cache ?
> Are there some testimonials somewhere about this use case ?

That's basically what reddit is using it for.
http://blog.reddit.com/2010/03/she-who-entangles-men.html

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: cassandra 0.5.1 java.lang.OutOfMemoryError: Java heap space issue

2010-04-26 Thread Jonathan Ellis

0.5 has a bug that allows it to OOM itself from replaying the log too
fast.  You should upgrade to 0.6.1.

On Mon, Apr 26, 2010 at 12:11 PM, elsif  wrote:
>
> Hello.  I have a six node cassandra cluster running on modest hardware
> with 1G of heap assigned to cassandra.  After inserting about 245
> million rows of data, cassandra failed with a
> java.lang.OutOfMemoryError: Java heap space error.  I rasied the java
> heap to 2G, but still get the same error when trying to restart cassandra.
>
> I am using Cassandra 0.5.1 with Sun jre1.6.0_18.
>
> Any thoughts on how to resolve this issue are greatly appreciated.
>
> Here are log excerpts from two of the nodes:
>
> DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,490
> SliceQueryFilter.java (line 116) collecting SuperColumn(dcf9f19e
> [0a011d0d,])
> DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,490
> SliceQueryFilter.java (line 116) collecting SuperColumn(dd04bf9c
> [0a011d0c,0a011d0d,])
> DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,490
> SliceQueryFilter.java (line 116) collecting SuperColumn(dd08981a
> [0a011d0c,0a011d0d,])
> DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,490
> SliceQueryFilter.java (line 116) collecting SuperColumn(dd7f7ac9
> [0a011d0c,0a011d0d,])
> DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,490
> SliceQueryFilter.java (line 116) collecting SuperColumn(dde1d4cf
> [0a011d0d,])
> DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,491
> SliceQueryFilter.java (line 116) collecting SuperColumn(de32aec3
> [0a011d0d,])
> DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,491
> SliceQueryFilter.java (line 116) collecting SuperColumn(de378105
> [0a011d0c,0a011d0d,])
> DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,491
> SliceQueryFilter.java (line 116) collecting SuperColumn(deb5d591
> [0a011d0d,])
> DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,491
> SliceQueryFilter.java (line 116) collecting SuperColumn(ded75dee
> [0a011d0c,0a011d0d,])
> DEBUG [HINTED-HANDOFF-POOL:1] 2010-04-23 16:19:20,491
> SliceQueryFilter.java (line 116) collecting SuperColumn(defe3445
> [0a011d0c,0a011d0d,])
>  INFO [FLUSH-TIMER] 2010-04-23 16:20:00,071 ColumnFamilyStore.java (line
> 393) IpTag has reached its threshold; switching in a fresh Memtable
>  INFO [FLUSH-TIMER] 2010-04-23 16:20:00,072 ColumnFamilyStore.java (line
> 1035) Enqueuing flush of Memtable(IpTag)@7816
>  INFO [FLUSH-SORTER-POOL:1] 2010-04-23 16:20:00,072 Memtable.java (line
> 183) Sorting Memtable(IpTag)@7816
>  INFO [FLUSH-WRITER-POOL:1] 2010-04-23 16:20:00,107 Memtable.java (line
> 192) Writing Memtable(IpTag)@7816
> DEBUG [Timer-0] 2010-04-23 16:20:00,130 LoadDisseminator.java (line 39)
> Disseminating load info ...
> ERROR [ROW-MUTATION-STAGE:41] 2010-04-23 16:20:00,348
> CassandraDaemon.java (line 71) Fatal exception in thread
> Thread[ROW-MUTATION-STAGE:41,5,main]
> java.lang.OutOfMemoryError: Java heap space
>        at java.util.Arrays.copyOfRange(Unknown Source)
>        at java.lang.String.(Unknown Source)
>        at java.lang.StringBuilder.toString(Unknown Source)
>        at
> org.apache.cassandra.db.marshal.AbstractType.getColumnsString(AbstractType.java:87)
>        at
> org.apache.cassandra.db.ColumnFamily.toString(ColumnFamily.java:344)
>        at
> org.apache.commons.lang.ObjectUtils.toString(ObjectUtils.java:241)
>        at org.apache.commons.lang.StringUtils.join(StringUtils.java:3073)
>        at org.apache.commons.lang.StringUtils.join(StringUtils.java:3133)
>        at
> org.apache.cassandra.db.RowMutation.toString(RowMutation.java:263)
>        at java.lang.String.valueOf(Unknown Source)
>        at java.lang.StringBuilder.append(Unknown Source)
>        at
> org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:46)
>        at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:38)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
>        at java.lang.Thread.run(Unknown Source)
>
> ---
>
> DEBUG [main] 2010-04-23 17:15:45,501 CommitLog.java (line 312) Reading
> mutation at 57527476
> DEBUG [main] 2010-04-23 17:16:11,375 CommitLog.java (line 340) replaying
> mutation for system.Tracking: {ColumnFamily(HintsColumnFamily [7af4c5c0,])}
> DEBUG [main] 2010-04-23 17:16:45,293 CommitLog.java (line 312) Reading
> mutation at 57527686
> DEBUG [main] 2010-04-23 17:16:45,294 CommitLog.java (line 340) replaying
> mutation for system.Tracking: {ColumnFamily(HintsColumnFamily [7af4c5fb,])}
> DEBUG [main] 2010-04-23 17:16:54,311 CommitLog.java (line 312) Reading
> mutation at 57527919
> DEBUG [main] 2010-04-23 17:17:46,344 CommitLog.java (line 340) replaying
> mutation for system.Tracking: {ColumnFamily(HintsColumnFamily [7af4c5fb,])}
> DEBUG [main] 2010-04-23 17:17:55,530 CommitLog.java (line 312) Reading
> mutation at 57528129
> DEBUG [main] 2010-04-23 17:18:20,266 CommitLog.java (line 340) replayi

Re: how to get apache cassandra version with thrift client ?

2010-04-26 Thread Jonathan Ellis

You can't get the Cassandra release version, but you can get the
Thrift api version, which is more useful.  It's compiled as a constant
VERSION string in your client library.  See the comments in
interface/cassandra.thrift.

On Mon, Apr 26, 2010 at 8:14 PM, Shuge Lee  wrote:
> Hi all:
> How to get apache cassandra version with thrift client ?
> Thanks for reply.
>
> --
> Shuge Lee | Lee Li | 李蠡
>

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra cluster runs into OOM when bulk loading data

2010-04-26 Thread Eric Yu

I have the same problem here, and I analysised the hprof file with mat, as
you said, LinkedBlockQueue used 2.6GB.
I think the ThreadPool of cassandra should limit the queue size.

cassandra 0.6.1

java version
$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

iostat
$ iostat -x -l 1
Device: rrqm/s   wrqm/s   r/s   w/srkB/swkB/s avgrq-sz
avgqu-sz   await  svctm  %util
sda  81.00  8175.00 224.00 17.00 23984.00  2728.00   221.68
1.011.86   0.76  18.20

tpstats, of coz, this node is still alive
$ ./nodetool -host localhost tpstats
Pool NameActive   Pending  Completed
FILEUTILS-DELETE-POOL 0 0   1281
STREAM-STAGE  0 0  0
RESPONSE-STAGE0 0  473617241
ROW-READ-STAGE0 0  0
LB-OPERATIONS 0 0  0
MESSAGE-DESERIALIZER-POOL 0 0  718355184
GMFD  0 0 132509
LB-TARGET 0 0  0
CONSISTENCY-MANAGER   0 0  0
ROW-MUTATION-STAGE0 0  293735704
MESSAGE-STREAMING-POOL0 0  6
LOAD-BALANCER-STAGE   0 0  0
FLUSH-SORTER-POOL 0 0  0
MEMTABLE-POST-FLUSHER 0 0   1870
FLUSH-WRITER-POOL 0 0   1870
AE-SERVICE-STAGE  0 0  5
HINTED-HANDOFF-POOL   0 0 21

On Tue, Apr 27, 2010 at 3:32 AM, Chris Goffinet  wrote:

> Upgrade to b20 of Sun's version of JVM. This OOM might be related to
> LinkedBlockQueue issues that were fixed.
>
> -Chris
>
>
> 2010/4/26 Roland Hänel 
>
>> Cassandra Version 0.6.1
>> OpenJDK Server VM (build 14.0-b16, mixed mode)
>> Import speed is about 10MB/s for the full cluster; if a compaction is
>> going on the individual node is I/O limited
>> tpstats: caught me, didn't know this. I will set up a test and try to
>> catch a node during the critical time.
>>
>> Thanks,
>> Roland
>>
>>
>> 2010/4/26 Chris Goffinet 
>>
>>  Which version of Cassandra?
>>> Which version of Java JVM are you using?
>>> What do your I/O stats look like when bulk importing?
>>> When you run `nodeprobe -host  tpstats` is any thread pool backing up
>>> during the import?
>>>
>>> -Chris
>>>
>>>
>>> 2010/4/26 Roland Hänel 
>>>
>>> I have a cluster of 5 machines building a Cassandra datastore, and I load
 bulk data into this using the Java Thrift API. The first ~250GB runs fine,
 then, one of the nodes starts to throw OutOfMemory exceptions. I'm not 
 using
 and row or index caches, and since I only have 5 CF's and some 2,5 GB of 
 RAM
 allocated to the JVM (-Xmx2500M), in theory, that should happen. All 
 inserts
 are done with consistency level ALL.

 I hope with this I have avoided all the 'usual dummy errors' that lead
 to OOM's. I have begun to troubleshoot the issue with JMX, however, it's
 difficult to catch the JVM in the right moment because it runs well for
 several hours before this thing happens.

 One thing gets to my mind, maybe one of the experts could confirm or
 reject this idea for me: is it possible that when one machine slows down a
 little bit (for example because a big compaction is going on), the 
 memtables
 don't get flushed to disk as fast as they are building up under the
 continuing bulk import? That would result in a downward spiral, the system
 gets slower and slower on disk I/O, but since more and more data arrives
 over Thrift, finally OOM.

 I'm using the "periodic" commit log sync, maybe also this could create a
 situation where the commit log writer is too slow to catch up with the data
 intake, resulting in ever growing memory usage?

 Maybe these thoughts are just bullshit. Let me now if so... ;-)

>>>
>>
>

Re: value size, is there a suggested limit?

2010-04-26 Thread dir dir

Hi Ahmed,

Casandra has a limitation to store value in to database. the maximum size is
2^31-1 byte.
if you have more than 2^31-1 byte, I suggest you to create several chunk
data.

On Mon, Apr 26, 2010 at 3:19 AM, S Ahmed  wrote:

> Is there a suggested sized maximum that you can set the value of a given
> key?
>
> e.g. could I convert a document to bytes and store it as a value to a key?
>  if yes, which I presume so, what if the file is 10mb? or 100mb?
>

Re: how to get apache cassandra version with thrift client ?

2010-04-26 Thread Shuge Lee

I know I can get thrift API version.

However, I writing a CLI for Cassandra in Python with readline support,
and it will supports one-key deploy/upgrade cassandra+thrift remote,
I need to get ApacheCassandra version to make sure it has deploy
successfully.


2010/4/27 Jonathan Ellis 

> You can't get the Cassandra release version, but you can get the
> Thrift api version, which is more useful.  It's compiled as a constant
> VERSION string in your client library.  See the comments in
> interface/cassandra.thrift.
>
> On Mon, Apr 26, 2010 at 8:14 PM, Shuge Lee  wrote:
> > Hi all:
> > How to get apache cassandra version with thrift client ?
> > Thanks for reply.
> >
> > --
> > Shuge Lee | Lee Li | 李蠡
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>



-- 
Shuge Lee | Lee Li | 李蠡

Re: Cassandra cluster runs into OOM when bulk loading data

2010-04-26 Thread Chris Goffinet

I'll work on doing more tests around this. In 0.5 we used a different data 
structure that required polling. But this does seem problematic. 

-Chris

On Apr 26, 2010, at 7:04 PM, Eric Yu wrote:

> I have the same problem here, and I analysised the hprof file with mat, as 
> you said, LinkedBlockQueue used 2.6GB.
> I think the ThreadPool of cassandra should limit the queue size.
> 
> cassandra 0.6.1
> 
> java version
> $ java -version
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
> 
> iostat
> $ iostat -x -l 1
> Device: rrqm/s   wrqm/s   r/s   w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await  svctm  %util
> sda  81.00  8175.00 224.00 17.00 23984.00  2728.00   221.68 
> 1.011.86   0.76  18.20
> 
> tpstats, of coz, this node is still alive
> $ ./nodetool -host localhost tpstats  
> Pool NameActive   Pending  Completed
> FILEUTILS-DELETE-POOL 0 0   1281
> STREAM-STAGE  0 0  0
> RESPONSE-STAGE0 0  473617241
> ROW-READ-STAGE0 0  0
> LB-OPERATIONS 0 0  0
> MESSAGE-DESERIALIZER-POOL 0 0  718355184
> GMFD  0 0 132509
> LB-TARGET 0 0  0
> CONSISTENCY-MANAGER   0 0  0
> ROW-MUTATION-STAGE0 0  293735704
> MESSAGE-STREAMING-POOL0 0  6
> LOAD-BALANCER-STAGE   0 0  0
> FLUSH-SORTER-POOL 0 0  0
> MEMTABLE-POST-FLUSHER 0 0   1870
> FLUSH-WRITER-POOL 0 0   1870
> AE-SERVICE-STAGE  0 0  5
> HINTED-HANDOFF-POOL   0 0 21
> 
> 
> On Tue, Apr 27, 2010 at 3:32 AM, Chris Goffinet  wrote:
> Upgrade to b20 of Sun's version of JVM. This OOM might be related to 
> LinkedBlockQueue issues that were fixed.
> 
> -Chris
> 
> 
> 2010/4/26 Roland Hänel 
> Cassandra Version 0.6.1
> OpenJDK Server VM (build 14.0-b16, mixed mode)
> Import speed is about 10MB/s for the full cluster; if a compaction is going 
> on the individual node is I/O limited
> tpstats: caught me, didn't know this. I will set up a test and try to catch a 
> node during the critical time.
> 
> Thanks,
> Roland
> 
> 
> 2010/4/26 Chris Goffinet 
> 
> Which version of Cassandra?
> Which version of Java JVM are you using?
> What do your I/O stats look like when bulk importing?
> When you run `nodeprobe -host  tpstats` is any thread pool backing up 
> during the import?
> 
> -Chris
> 
> 
> 2010/4/26 Roland Hänel 
> 
> I have a cluster of 5 machines building a Cassandra datastore, and I load 
> bulk data into this using the Java Thrift API. The first ~250GB runs fine, 
> then, one of the nodes starts to throw OutOfMemory exceptions. I'm not using 
> and row or index caches, and since I only have 5 CF's and some 2,5 GB of RAM 
> allocated to the JVM (-Xmx2500M), in theory, that should happen. All inserts 
> are done with consistency level ALL.
> 
> I hope with this I have avoided all the 'usual dummy errors' that lead to 
> OOM's. I have begun to troubleshoot the issue with JMX, however, it's 
> difficult to catch the JVM in the right moment because it runs well for 
> several hours before this thing happens.
> 
> One thing gets to my mind, maybe one of the experts could confirm or reject 
> this idea for me: is it possible that when one machine slows down a little 
> bit (for example because a big compaction is going on), the memtables don't 
> get flushed to disk as fast as they are building up under the continuing bulk 
> import? That would result in a downward spiral, the system gets slower and 
> slower on disk I/O, but since more and more data arrives over Thrift, finally 
> OOM.
> 
> I'm using the "periodic" commit log sync, maybe also this could create a 
> situation where the commit log writer is too slow to catch up with the data 
> intake, resulting in ever growing memory usage?
> 
> Maybe these thoughts are just bullshit. Let me now if so... ;-)
> 
> 
> 
> 
> 
>

error during snapshot

2010-04-26 Thread Lee Parker

I was attempting to get a snapshot on our cassandra nodes.  I get the
following error every time I run nodetool ... snapshot.

Exception in thread "main" java.io.IOException: Cannot run program "ln":
java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:221)
at
org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1060)
at org.apache.cassandra.db.Table.snapshot(Table.java:256)
at
org.apache.cassandra.service.StorageService.takeAllSnapshot(StorageService.java:1005)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
at
com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
at
com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
at
javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1426)
at
javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
at
javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1264)
at
javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1359)
at
javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)
at sun.rmi.transport.Transport$1.run(Transport.java:159)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
at
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
at
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 34 more

The nodes are both Amazon EC2 Large instances with 7.5G RAM (6 allocated for
Java heap) with two cores and only 70G of data in casssandra.  They have
plenty of available RAM and HD space.  Has anyone else run into this error?

Lee Parker

Re: Cassandra use cases: as a datagrid ? as a distributed cache ?

2010-04-26 Thread Joseph Stein

great talk tonight in NYC I attended in regards to using Cassandra as
a Lucene Index store (really great idea nicely implemented)
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/

so Lucinda uses Cassandra as a distributed cache of indexes =8^)


On Mon, Apr 26, 2010 at 9:47 PM, Jonathan Ellis  wrote:
> On Mon, Apr 26, 2010 at 9:04 AM, Dominique De Vito
>  wrote:
>> (1) has anyone already used Cassandra as an in-memory data grid ?
>> If no, does anyone know how far such a database is from, let's say, Oracle
>> Coherence ?
>> Does Cassandra provide, for example, a (synchronized) cache on the client
>> side ?
>
> If you mean an in-process cache on the client side, no.
>
>> (2) has anyone already used Cassandra as a distributed cache ?
>> Are there some testimonials somewhere about this use case ?
>
> That's basically what reddit is using it for.
> http://blog.reddit.com/2010/03/she-who-entangles-men.html
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>



-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Re: Cassandra use cases: as a datagrid ? as a distributed cache ?

2010-04-26 Thread Joseph Stein

(sp) Lucandra http://github.com/tjake/Lucandra

On Mon, Apr 26, 2010 at 11:08 PM, Joseph Stein  wrote:
> great talk tonight in NYC I attended in regards to using Cassandra as
> a Lucene Index store (really great idea nicely implemented)
> http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/
>
> so Lucinda uses Cassandra as a distributed cache of indexes =8^)
>
>
> On Mon, Apr 26, 2010 at 9:47 PM, Jonathan Ellis  wrote:
>> On Mon, Apr 26, 2010 at 9:04 AM, Dominique De Vito
>>  wrote:
>>> (1) has anyone already used Cassandra as an in-memory data grid ?
>>> If no, does anyone know how far such a database is from, let's say, Oracle
>>> Coherence ?
>>> Does Cassandra provide, for example, a (synchronized) cache on the client
>>> side ?
>>
>> If you mean an in-process cache on the client side, no.
>>
>>> (2) has anyone already used Cassandra as a distributed cache ?
>>> Are there some testimonials somewhere about this use case ?
>>
>> That's basically what reddit is using it for.
>> http://blog.reddit.com/2010/03/she-who-entangles-men.html
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>>
>
>
>
> --
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
>



-- 
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

76 matches

Mail list logo