Re: 0.6.1 insert 1B rows, crashed when using py_stress

2010-04-20 Thread Benjamin Black
Not so reasonable, given what you are trying to accomplish.  A 1GB
heap (on a 2GB machine) is fine for development and functional
testing, but I wouldn't try to deal with the number of rows you are
describing with less than 8GB/node with 4-6GB heap.


b

On Mon, Apr 19, 2010 at 7:32 PM, Ken Sandney  wrote:
> I am just running Cassandra on normal boxes, and grants 1GB of total 2GB to
> Cassandra is reasonable I think. Can this problem be resolved by tuning the
> thresholds described on this page , or just be waiting for the 0.7 release
> as Brandon mentioned?
>
> On Tue, Apr 20, 2010 at 10:15 AM, Jonathan Ellis  wrote:
>>
>> Schubert, I don't know if you saw this in the other thread referencing
>> your slides:
>>
>> It looks like the slowdown doesn't hit until after several GCs,
>> although it's hard to tell since the scale is different on the GC
>> graph and the insert throughput ones.
>>
>> Perhaps this is compaction kicking in, not GCs?  Definitely the extra
>> I/O + CPU load from compaction will cause a drop in throughput.
>>
>> On Mon, Apr 19, 2010 at 9:06 PM, Schubert Zhang  wrote:
>> > -Xmx1G is too small.
>> > In my cluster, 8GB ram on each node, and I grant 6GB to cassandra.
>> >
>> > Please see my test @
>> > http://www.slideshare.net/schubertzhang/presentations
>> >
>> > –Memory, GC..., always to be the bottleneck and big issue of java-based
>> > infrastructure software!
>> >
>> > References:
>> > –http://wiki.apache.org/cassandra/FAQ#slows_down_after_lotso_inserts
>> > –https://issues.apache.org/jira/browse/CASSANDRA-896
>> > (LinkedBlockingQueue
>> > issue, fixed in jdk-6u19)
>> >
>> > In fact, always when I using java-based infrastructure software, such as
>> > Cassandra, Hadoop, HBase, etc, I am also pained about such memory/GC
>> > issue
>> > finally.
>> >
>> > Then, we should provide higher harware with more RAM (such as
>> > 32GB~64GB),
>> > more CPU cores (such as 8~16). And we still cannot control the
>> > Out-Of-Memory-Error.
>> >
>> > I am thinking, maybe it is not right to leave the job of memory control
>> > to
>> > JVM.
>> >
>> > I have a long experience in telecom and embedded software in past ten
>> > years,
>> > where need robust programs and small RAM. I want to discuss following
>> > ideas
>> > with the community:
>> >
>> > 1. Manage the memory by ourselves: allocate objects/resource (memory) at
>> > initiating phase, and assign instances at runtime.
>> > 2. Reject the request when be short of resource, instead of throws OOME
>> > and
>> > exit (crash).
>> >
>> > 3. I know, it is not easy in java program.
>> >
>> > Schubert
>> >
>> > On Tue, Apr 20, 2010 at 9:40 AM, Ken Sandney 
>> > wrote:
>> >>
>> >> here is my JVM options, by default, I didn't modify them, from
>> >> cassandra.in.sh
>> >>>
>> >>> # Arguments to pass to the JVM
>> >>>
>> >>> JVM_OPTS=" \
>> >>>
>> >>>         -ea \
>> >>>
>> >>>         -Xms128M \
>> >>>
>> >>>         -Xmx1G \
>> >>>
>> >>>         -XX:TargetSurvivorRatio=90 \
>> >>>
>> >>>         -XX:+AggressiveOpts \
>> >>>
>> >>>         -XX:+UseParNewGC \
>> >>>
>> >>>         -XX:+UseConcMarkSweepGC \
>> >>>
>> >>>         -XX:+CMSParallelRemarkEnabled \
>> >>>
>> >>>         -XX:+HeapDumpOnOutOfMemoryError \
>> >>>
>> >>>         -XX:SurvivorRatio=128 \
>> >>>
>> >>>         -XX:MaxTenuringThreshold=0 \
>> >>>
>> >>>         -Dcom.sun.management.jmxremote.port=8080 \
>> >>>
>> >>>         -Dcom.sun.management.jmxremote.ssl=false \
>> >>>
>> >>>         -Dcom.sun.management.jmxremote.authenticate=false"
>> >>
>> >> and my box is normal pc with 2GB ram, Intel E3200  @ 2.40GHz. By the
>> >> way,
>> >> I am using the latest Sun JDK
>> >> On Tue, Apr 20, 2010 at 9:33 AM, Schubert Zhang 
>> >> wrote:
>> >>>
>> >>> Seems you should configure larger jvm-heap.
>> >>>
>> >>> On Tue, Apr 20, 2010 at 9:32 AM, Schubert Zhang 
>> >>> wrote:
>> 
>>  Please also post your jvm-heap and GC options, i.e. the seting in
>>  cassandra.in.sh
>>  And what about you node hardware?
>> 
>>  On Tue, Apr 20, 2010 at 9:22 AM, Ken Sandney 
>>  wrote:
>> >
>> > Hi
>> > I am doing a insert test with 9 nodes, the command:
>> >>
>> >> stress.py -n 10 -t 1000 -c 10 -o insert -i 5 -d
>> >> 10.0.0.1,10.0.0.2.
>> >
>> > and  5 of the 9 nodes were cashed, only about 6'500'000 rows were
>> > inserted
>> > I checked out the system.log and seems the reason are 'out of
>> > memory'.
>> > I don't if this had something to do with my settings.
>> > Any idea about this?
>> > Thank you, and the following are the errors from system.log
>> >
>> >>
>> >> ERROR [CACHETABLE-TIMER-1] 2010-04-19 20:43:14,013
>> >> CassandraDaemon.java (line 78) Fatal exception in thread
>> >> Thread[CACHETABLE-TIMER-1,5,main]
>> >>
>> >> java.lang.OutOfMemoryError: Java heap space
>> >>
>> >>         at
>> >>
>> >> org.apache.cassandra.utils.ExpiringMap$CacheMonitor.run(ExpiringMap.jav

How to increase cassandra's performance in read?

2010-04-20 Thread yangfeng
I  get 10 columns Family by keys and  one columns Family has 30 columns.
I use multigetSlice once to get 10 column Family.but the performance is so
poor.
anyone has other  thought to increase the performance.


RE: Cassandra Java Client

2010-04-20 Thread Dop Sun
Hi,

 

I have downloaded hector-0.6.0-10.jar. As you mentioned, it has good 
implementation for the connection pooling, JMX counters.

 

What I’m doing is: using Hector to create the Cassandra client (be specific: 
borrow_client(url, port)). And my understanding is: in this way, the Jassandra 
will enjoy the client pool and JMX counter.

 

http://code.google.com/p/jassandra/issues/detail?id=17

 

Please feel free to let me know if you have any suggestions.

 

The new build 1.0.0 build 3(http://code.google.com/p/jassandra/) created. From 
Jassandra client side, no API changes.

 

Cheers~~~

Dop

 

From: Ran Tavory [mailto:ran...@gmail.com] 
Sent: Tuesday, April 20, 2010 1:36 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra Java Client

 

Hi Dop, you may want to look at hector as a low level cassandra client on which 
you build jassandra, adding hibernate style magic etc like other ppl have done 
with ORM layers on top of it.

Hector's main features include extensive jmx counters, failover and connection 
pooling. 

It's available for all recent versions, including 0.5.0, 0.5.1, 0.6.0 and 0.6.1

On Mon, Apr 19, 2010 at 5:58 PM, Dop Sun  wrote:

Well, there are couple of points while Jassandra is created:

1. First of all, I want to create something like that is because I come from
JDBC background, and familiar with Hibernate API. The ICriteria (which is
created for querying) is inspired by the Criteria API from hibernate.

Actually, maybe because of this background, it cost me a lot efforts try to
understand Cassandra in the beginning and Thrift API also takes time to use.

2. The Jassandra creates a layer, which removes the direct link to
underlying Thrift API (including the exceptions, ConsistencyLevel
enumeration etc)

High light this point because I believe the client of the Jassandra will
benefit for the implementation changes in future, for example, if the
Cassandra provides better Thrift API to selecting the columns for a list of
keys, SCFs, or deprecating some structures, exceptions, the client may not
be changed. Of cause, if Jassandra failed to approve itself, this is
actually not the advantage. :)

3. The Jassandra is designed to be an JDBC like API, no less, no more. It
strives to use the best API to do the quering (with token, key, SCF/ CF),
doing the CRUD, but no more than that. For example, it does not cover any
API like object mapping. But it should cover all the API functionalities
Thrift provided.

These 3 points, are different from Hector (I should be honest that I have
not tried to use it before, the feeling of difference are coming from the
sample code Hector provided).

So, the API Jassandra abstracted was something like this:

   IConnection connection = DriverManager.getConnection(
   "thrift://localhost:9160", info);
   try {
 // 2. Get a KeySpace by name
 IKeySpace keySpace = connection.getKeySpace("Keyspace1");

 // 3. Get a ColumnFamily by name
 IColumnFamily cf = keySpace.getColumnFamily("Standard2");

 // 4. Insert like this
 long now = System.currentTimeMillis();
 ByteArray nameFirst = ByteArray.ofASCII("first");
 ByteArray nameLast = ByteArray.ofASCII("last");
 ByteArray nameAge = ByteArray.ofASCII("age");
 ByteArray valueLast = ByteArray.ofUTF8("Smith");
 IColumn colFirst = new Column(nameFirst, ByteArray.ofUTF8("John"),
now);
 cf.insert(userName, colFirst);

 IColumn colLast = new Column(nameLast, valueLast, now);
 cf.insert(userName, colLast);

 IColumn colAge = new Column(nameAge, ByteArray.ofLong(42), now);
 cf.insert(userName, colAge);

 // 5. Select like this
 ICriteria criteria = cf.createCriteria();
 criteria.keyList(Lists.newArrayList(userName))
 .columnRange(nameAge, nameLast, 10);
 Map> map = criteria.select();
 List list = map.get(userName);
 Assert.assertEquals(3, list.size());
 Assert.assertEquals(valueLast, list.get(2).getValue());

 // 6. Delete like this
 cf.delete(userName, colFirst);
 map = criteria.select();
 Assert.assertEquals(2, map.get(userName).size());

 // 7. Get count like this
 criteria = cf.createCriteria();
 criteria.keyList(Lists.newArrayList(userName));
 int count = criteria.count();
 Assert.assertEquals(2, count);
   } finally {
 // 8. Don't forget to close the connection.
 connection.close();

   }
 }

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com]
Sent: Monday, April 19, 2010 10:35 PM
To: user@cassandra.apache.org

Subject: Re: Cassandra Java Client

How is Jassandra different from http://github.com/rantav/hector ?

On Mon, Apr 19, 2010 at 9:21 AM, Dop Sun  wrote:
> May I take this chance to share this link here:
>
> http://code.google.com/p/jassandra/
>
>
>
> It currently based with Cassandra 0.6 Thrift APIs.
>
>
>
> The class ThriftCriteria and ThriftColumnFamily has direct use of Thrift
> API. Also, the site itself has test code, which is actually wo

Re: 0.6.1 insert 1B rows, crashed when using py_stress

2010-04-20 Thread Eric Evans
On Tue, 2010-04-20 at 10:39 +0800, Ken Sandney wrote:
> Sorry I just don't know how to resolve this :)

http://wiki.apache.org/cassandra/FAQ#slows_down_after_lotso_inserts

> On Tue, Apr 20, 2010 at 10:37 AM, Jonathan Ellis 
> wrote:
> 
> > Ken, I linked you to the FAQ answering your problem in the first
> reply
> > you got.  Please don't hijack my replies to other people; that's
> rude. 

-- 
Eric Evans
eev...@rackspace.com



Re: tcp CLOSE_WAIT bug

2010-04-20 Thread Ingram Chen
I trace IncomingStreamReader source and found that incoming socket comes
from MessagingService$SocketThread.
but there is no close() call on either accepted socket or socketChannel.

Should I file a bug report ?

On Tue, Apr 20, 2010 at 11:02, Ingram Chen  wrote:

> this happened after several hours of operations and both nodes are started
> at the same time (clean start without any data). so it might not relate to
> Bootstrap.
>
> In system.log I do not see any logs like "xxx node dead" or exceptions. and
> both nodes in test are alive. they serve read/write well, too. Below four
> connections between nodes are keep healthy from time to time.
>
>
> tcp0  0 :::192.168.2.87:7000:::192.168.2.88:58447
> ESTABLISHED
> tcp0  0 :::192.168.2.87:54986   :::192.168.2.88:7000
> ESTABLISHED
> tcp0  0 :::192.168.2.87:59138   :::192.168.2.88:7000
> ESTABLISHED
> tcp0  0 :::192.168.2.87:7000:::192.168.2.88:39074
> ESTABLISHED
>
> so connections end in CLOSE_WAIT should be newly created. (for streaming ?)
> This seems related to streaming issues we suffered recently:
> http://n2.nabble.com/busy-thread-on-IncomingStreamReader-td4908640.html
>
> I would like add some debug codes around opening and closing of socket to
> find out what happend.
>
> Could you give me some hint, about what classes I should take look ?
>
>
>
> On Tue, Apr 20, 2010 at 04:47, Jonathan Ellis  wrote:
>
>> Is this after doing a bootstrap or other streaming operation?  Or did
>> a node go down?
>>
>> The internal sockets are supposed to remain open, otherwise.
>>
>> On Mon, Apr 19, 2010 at 10:56 AM, Ingram Chen 
>> wrote:
>> > Thank your information.
>> >
>> > We do use connection pools with thrift client and ThriftAdress is on
>> port
>> > 9160.
>> >
>> > Those problematic connections we found are all in port 7000, which is
>> > internal communications port between
>> > nodes. I guess this related to StreamingService.
>> >
>> > On Mon, Apr 19, 2010 at 23:46, Brandon Williams 
>> wrote:
>> >>
>> >> On Mon, Apr 19, 2010 at 10:27 AM, Ingram Chen 
>> >> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> We have observed several connections between nodes in CLOSE_WAIT
>> >>> after several hours of operation:
>> >>
>> >> This is symptomatic of not pooling your client connections correctly.
>>  Be
>> >> sure you're using one connection per thread, not one connection per
>> >> operation.
>> >> -Brandon
>> >
>> >
>> > --
>> > Ingram Chen
>> > online share order: http://dinbendon.net
>> > blog: http://www.javaworld.com.tw/roller/page/ingramchen
>> >
>>
>
>
>
> --
> Ingram Chen
> online share order: http://dinbendon.net
> blog: http://www.javaworld.com.tw/roller/page/ingramchen
>



-- 
Ingram Chen
online share order: http://dinbendon.net
blog: http://www.javaworld.com.tw/roller/page/ingramchen


RE: How to increase cassandra's performance in read?

2010-04-20 Thread Mark Jones
I too am seeing very slow performance while testing worst case scenarios of 1 
key leading to 1 supercolumn and 1 column beyond that.

Key -> SuperColumn -> 1 Column (of ~ 500 bytes)

Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.  
(With NO swapping)  So far, I've found nothing that helps, including increasing 
the keycache FROM 200k-500k keys, I'm guessing the hashing prevents better 
cache performance.

Read performance is definitely not 3 IOs based on the utilization factors on my 
drives.  I'm not sure the issue was ever settled in the previous e-mails as to 
how to calculate how many IOs were being done for each read.  I've been testing 
with clusters of 1,2,3 or 4 machines and so far all I'm seeing with multiple 
machines, is lower performance in a cluster than alone.  I keep assuming that 
at some number of nodes, the performance will begin to pick up.  Three of my 
nodes are running with 8GB (6GB Java Heap), and one has 4GB (3GB Java Heap).  
The machine with the smallest memory footprint is the fastest performer on 
inserts, but definitely not the fastest on reads.

I'm suspecting the read path is relying heavily on the fact that you want to 
get many columns that are closely related, because lookup by key appears to be 
incredibly slow.

From: yangfeng [mailto:yea...@gmail.com]
Sent: Tuesday, April 20, 2010 7:59 AM
To: user@cassandra.apache.org; d...@cassandra.apache.org
Subject: How to increase cassandra's performance in read?

I  get 10 columns Family by keys and  one columns Family has 30 columns.
I use multigetSlice once to get 10 column Family.but the performance is so poor.
anyone has other  thought to increase the performance.



Tool for managing cluster nodes?

2010-04-20 Thread Joost Ouwerkerk
What are people using to manage Cassandra cluster nodes?  i.e. to start,
stop, copy config files, etc.  I'm using cssh and wondering if there is a
better way...
Joost.


Re: Tool for managing cluster nodes?

2010-04-20 Thread Roger Schildmeijer
dancer's shell / distributed shell

http://www.netfort.gr.jp/~dancer/software/dsh.html.en

On 20 apr 2010, at 17.18em, Joost Ouwerkerk wrote:

> What are people using to manage Cassandra cluster nodes?  i.e. to start, 
> stop, copy config files, etc.  I'm using cssh and wondering if there is a 
> better way...
> Joost.



Re: How to increase cassandra's performance in read?

2010-04-20 Thread Jonathan Ellis
How many columns are in the supercolumn total?

"in super columnfamilies there is a third level of subcolumns; these
are not indexed, and any request for a subcolumn deserializes _all_
the subcolumns in that supercolumn"

http://wiki.apache.org/cassandra/CassandraLimitations

On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones  wrote:
> I too am seeing very slow performance while testing worst case scenarios of
> 1 key leading to 1 supercolumn and 1 column beyond that.
>
>
>
> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>
>
>
> Drive utilization is 80-90% and I’m only dealing with 50-70 million rows.
> (With NO swapping)  So far, I’ve found nothing that helps, including
> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
> prevents better cache performance.
>
>
>
> Read performance is definitely not 3 IOs based on the utilization factors on
> my drives.  I'm not sure the issue was ever settled in the previous e-mails
> as to how to calculate how many IOs were being done for each read.  I've
> been testing with clusters of 1,2,3 or 4 machines and so far all I’m seeing
> with multiple machines, is lower performance in a cluster than alone.  I
> keep assuming that at some number of nodes, the performance will begin to
> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
> the fastest performer on inserts, but definitely not the fastest on reads.
>
>
>
> I'm suspecting the read path is relying heavily on the fact that you want to
> get many columns that are closely related, because lookup by key appears to
> be incredibly slow.
>
>
>
> From: yangfeng [mailto:yea...@gmail.com]
> Sent: Tuesday, April 20, 2010 7:59 AM
> To: user@cassandra.apache.org; d...@cassandra.apache.org
> Subject: How to increase cassandra's performance in read?
>
>
>
> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>
> I use multigetSlice once to get 10 column Family.but the performance is so
> poor.
>
> anyone has other  thought to increase the performance.
>
>


RE: How to increase cassandra's performance in read?

2010-04-20 Thread Mark Jones
When I first read this, it bothered me because it seemed like it couldn't be 
so.  So I read the link, and it says the whole thing, so I have to ask for some 
classification here.

I had always assumed a super column was similar to a local keyspace, and that 
the SubColumns under it were similar to keys, that way you could localize the 
data for a user or a website.

So Keyspace:Email
  Key:UserID
 SuperColumn Entries:
Individual Email 1:  Columns {body, header, tags, recipients, 
flags, whatever}  Individual Email 2:  Columns {body, header, 
tags, recipients, flags, whatever}  Individual Email 3:  
Columns {body, header, tags, recipients, flags, whatever}

I think now this is probably the wrong concept.

It is really more like:
Primary Key: Name:Value pairs

And with Supercolumns, the Value part can be another Hash:
Primary Key: Name: {Name:Value pairs} pairs

But when I lookup by Primary Key, ALL of the data associated with the key will 
be brought into memory!  So, when if I wanted to display the inbox of a user 
with several years of email, it would be one HUGE read to suck his entire inbox 
into memory to get down to the point I could display one message.

Is this more correct?

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com]
Sent: Tuesday, April 20, 2010 10:47 AM
To: user@cassandra.apache.org
Subject: Re: How to increase cassandra's performance in read?

How many columns are in the supercolumn total?

"in super columnfamilies there is a third level of subcolumns; these
are not indexed, and any request for a subcolumn deserializes _all_
the subcolumns in that supercolumn"

http://wiki.apache.org/cassandra/CassandraLimitations

On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones  wrote:
> I too am seeing very slow performance while testing worst case scenarios of
> 1 key leading to 1 supercolumn and 1 column beyond that.
>
>
>
> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>
>
>
> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
> (With NO swapping)  So far, I've found nothing that helps, including
> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
> prevents better cache performance.
>
>
>
> Read performance is definitely not 3 IOs based on the utilization factors on
> my drives.  I'm not sure the issue was ever settled in the previous e-mails
> as to how to calculate how many IOs were being done for each read.  I've
> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
> with multiple machines, is lower performance in a cluster than alone.  I
> keep assuming that at some number of nodes, the performance will begin to
> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
> the fastest performer on inserts, but definitely not the fastest on reads.
>
>
>
> I'm suspecting the read path is relying heavily on the fact that you want to
> get many columns that are closely related, because lookup by key appears to
> be incredibly slow.
>
>
>
> From: yangfeng [mailto:yea...@gmail.com]
> Sent: Tuesday, April 20, 2010 7:59 AM
> To: user@cassandra.apache.org; d...@cassandra.apache.org
> Subject: How to increase cassandra's performance in read?
>
>
>
> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>
> I use multigetSlice once to get 10 column Family.but the performance is so
> poor.
>
> anyone has other  thought to increase the performance.
>
>


RE: How to increase cassandra's performance in read?

2010-04-20 Thread Mark Jones
Sorry, I didn't answer your question in my response, I have at this point:


Key(ID)
When/Where SuperColumn Tag:  and Columns {Data: One Value (not yet written, 
tags, flags)}


Under some keys (very small #) there will be 2 values like:

Key(ID)
When/Where SuperColumn Tag:  and Columns {Data: One Value (not yet written, 
tags, flags)}
When/Where SuperColumn Tag:  and Columns {Data: One Value (not yet written, 
tags, flags)}
Long term this list will be in the 1000's possibly millions

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com]
Sent: Tuesday, April 20, 2010 10:47 AM
To: user@cassandra.apache.org
Subject: Re: How to increase cassandra's performance in read?

How many columns are in the supercolumn total?

"in super columnfamilies there is a third level of subcolumns; these
are not indexed, and any request for a subcolumn deserializes _all_
the subcolumns in that supercolumn"

http://wiki.apache.org/cassandra/CassandraLimitations

On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones  wrote:
> I too am seeing very slow performance while testing worst case scenarios of
> 1 key leading to 1 supercolumn and 1 column beyond that.
>
>
>
> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>
>
>
> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
> (With NO swapping)  So far, I've found nothing that helps, including
> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
> prevents better cache performance.
>
>
>
> Read performance is definitely not 3 IOs based on the utilization factors on
> my drives.  I'm not sure the issue was ever settled in the previous e-mails
> as to how to calculate how many IOs were being done for each read.  I've
> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
> with multiple machines, is lower performance in a cluster than alone.  I
> keep assuming that at some number of nodes, the performance will begin to
> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
> the fastest performer on inserts, but definitely not the fastest on reads.
>
>
>
> I'm suspecting the read path is relying heavily on the fact that you want to
> get many columns that are closely related, because lookup by key appears to
> be incredibly slow.
>
>
>
> From: yangfeng [mailto:yea...@gmail.com]
> Sent: Tuesday, April 20, 2010 7:59 AM
> To: user@cassandra.apache.org; d...@cassandra.apache.org
> Subject: How to increase cassandra's performance in read?
>
>
>
> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>
> I use multigetSlice once to get 10 column Family.but the performance is so
> poor.
>
> anyone has other  thought to increase the performance.
>
>


Re: How to increase cassandra's performance in read?

2010-04-20 Thread Jonathan Ellis
Not all the data associated w/ the key is brought into memory, just
all the data associated w/ the supercolumns being queried.

Supercolumns are so you can update a smallish number of subcolumns
independently (e.g. when denormalizing an entire narrow row, usually
with a finite set of columns).  If you want lots of subcolumns you
need to turn that supercolumn into a new row.

On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones  wrote:
> When I first read this, it bothered me because it seemed like it couldn't be 
> so.  So I read the link, and it says the whole thing, so I have to ask for 
> some classification here.
>
> I had always assumed a super column was similar to a local keyspace, and that 
> the SubColumns under it were similar to keys, that way you could localize the 
> data for a user or a website.
>
> So Keyspace:Email
>  Key:UserID
>     SuperColumn Entries:
>                Individual Email 1:  Columns {body, header, tags, recipients, 
> flags, whatever}                  Individual Email 2:  Columns {body, header, 
> tags, recipients, flags, whatever}                  Individual Email 3:  
> Columns {body, header, tags, recipients, flags, whatever}
>
> I think now this is probably the wrong concept.
>
> It is really more like:
>        Primary Key: Name:Value pairs
>
> And with Supercolumns, the Value part can be another Hash:
>        Primary Key: Name: {Name:Value pairs} pairs
>
> But when I lookup by Primary Key, ALL of the data associated with the key 
> will be brought into memory!  So, when if I wanted to display the inbox of a 
> user with several years of email, it would be one HUGE read to suck his 
> entire inbox into memory to get down to the point I could display one message.
>
> Is this more correct?
>
> -Original Message-
> From: Jonathan Ellis [mailto:jbel...@gmail.com]
> Sent: Tuesday, April 20, 2010 10:47 AM
> To: user@cassandra.apache.org
> Subject: Re: How to increase cassandra's performance in read?
>
> How many columns are in the supercolumn total?
>
> "in super columnfamilies there is a third level of subcolumns; these
> are not indexed, and any request for a subcolumn deserializes _all_
> the subcolumns in that supercolumn"
>
> http://wiki.apache.org/cassandra/CassandraLimitations
>
> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones  wrote:
>> I too am seeing very slow performance while testing worst case scenarios of
>> 1 key leading to 1 supercolumn and 1 column beyond that.
>>
>>
>>
>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>>
>>
>>
>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
>> (With NO swapping)  So far, I've found nothing that helps, including
>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
>> prevents better cache performance.
>>
>>
>>
>> Read performance is definitely not 3 IOs based on the utilization factors on
>> my drives.  I'm not sure the issue was ever settled in the previous e-mails
>> as to how to calculate how many IOs were being done for each read.  I've
>> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
>> with multiple machines, is lower performance in a cluster than alone.  I
>> keep assuming that at some number of nodes, the performance will begin to
>> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
>> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
>> the fastest performer on inserts, but definitely not the fastest on reads.
>>
>>
>>
>> I'm suspecting the read path is relying heavily on the fact that you want to
>> get many columns that are closely related, because lookup by key appears to
>> be incredibly slow.
>>
>>
>>
>> From: yangfeng [mailto:yea...@gmail.com]
>> Sent: Tuesday, April 20, 2010 7:59 AM
>> To: user@cassandra.apache.org; d...@cassandra.apache.org
>> Subject: How to increase cassandra's performance in read?
>>
>>
>>
>> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>>
>> I use multigetSlice once to get 10 column Family.but the performance is so
>> poor.
>>
>> anyone has other  thought to increase the performance.
>>
>>
>


RE: How to increase cassandra's performance in read?

2010-04-20 Thread Mark Jones
To make sure I'm clear on what you are saying:

  Are the "Individual Emails" in the example below, Supercolumns and the {body, 
header, tags...} the subcolumns?

Is that a sane data layout for an email system?  Where the Supercolumn 
identifier is the "conversation label"

Sorry to be so daft, but the way columns and rows are bandied about in NoSQL is 
a bit confusing when you are coming from a SQL background.  I can't see why you 
would want multiple emails in the same row since they each have the same 
"columns" of information and therefore make good logical entities as outlined 
below.

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com]
Sent: Tuesday, April 20, 2010 11:16 AM
To: user@cassandra.apache.org
Subject: Re: How to increase cassandra's performance in read?

Not all the data associated w/ the key is brought into memory, just
all the data associated w/ the supercolumns being queried.

Supercolumns are so you can update a smallish number of subcolumns
independently (e.g. when denormalizing an entire narrow row, usually
with a finite set of columns).  If you want lots of subcolumns you
need to turn that supercolumn into a new row.

On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones  wrote:
> When I first read this, it bothered me because it seemed like it couldn't be 
> so.  So I read the link, and it says the whole thing, so I have to ask for 
> some classification here.
>
> I had always assumed a super column was similar to a local keyspace, and that 
> the SubColumns under it were similar to keys, that way you could localize the 
> data for a user or a website.
>
> So Keyspace:Email
>  Key:UserID
> SuperColumn Entries:
>Individual Email 1:  Columns {body, header, tags, recipients, flags, 
> whatever}
>Individual Email 2:  Columns {body, header, tags, recipients, flags, 
> whatever}
>Individual Email 3:  Columns {body, header, tags, recipients, flags, 
> whatever}
>
> I think now this is probably the wrong concept.
>
> It is really more like:
>Primary Key: Name:Value pairs
>
> And with Supercolumns, the Value part can be another Hash:
>Primary Key: Name: {Name:Value pairs} pairs
>
> But when I lookup by Primary Key, ALL of the data associated with the key 
> will be brought into memory!  So, when if I wanted to display the inbox of a 
> user with several years of email, it would be one HUGE read to suck his 
> entire inbox into memory to get down to the point I could display one message.
>
> Is this more correct?
>
> -Original Message-
> From: Jonathan Ellis [mailto:jbel...@gmail.com]
> Sent: Tuesday, April 20, 2010 10:47 AM
> To: user@cassandra.apache.org
> Subject: Re: How to increase cassandra's performance in read?
>
> How many columns are in the supercolumn total?
>
> "in super columnfamilies there is a third level of subcolumns; these
> are not indexed, and any request for a subcolumn deserializes _all_
> the subcolumns in that supercolumn"
>
> http://wiki.apache.org/cassandra/CassandraLimitations
>
> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones  wrote:
>> I too am seeing very slow performance while testing worst case scenarios of
>> 1 key leading to 1 supercolumn and 1 column beyond that.
>>
>>
>>
>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>>
>>
>>
>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
>> (With NO swapping)  So far, I've found nothing that helps, including
>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
>> prevents better cache performance.
>>
>>
>>
>> Read performance is definitely not 3 IOs based on the utilization factors on
>> my drives.  I'm not sure the issue was ever settled in the previous e-mails
>> as to how to calculate how many IOs were being done for each read.  I've
>> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
>> with multiple machines, is lower performance in a cluster than alone.  I
>> keep assuming that at some number of nodes, the performance will begin to
>> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
>> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
>> the fastest performer on inserts, but definitely not the fastest on reads.
>>
>>
>>
>> I'm suspecting the read path is relying heavily on the fact that you want to
>> get many columns that are closely related, because lookup by key appears to
>> be incredibly slow.
>>
>>
>>
>> From: yangfeng [mailto:yea...@gmail.com]
>> Sent: Tuesday, April 20, 2010 7:59 AM
>> To: user@cassandra.apache.org; d...@cassandra.apache.org
>> Subject: How to increase cassandra's performance in read?
>>
>>
>>
>> I  get 10 columns Family by keys and  one columns Family has 30 columns.
>>
>> I use multigetSlice once to get 10 column Family.but the performance is so
>> poor.
>>
>> anyone has other  thought to increase the performance.
>>
>>
>


Re: Modelling assets and user permissions

2010-04-20 Thread tsuraan
> Suppose I have a CF that holds some sort of assets that some users of
> my program have access to, and that some do not.  In SQL-ish terms it
> would look something like this:
>
> TABLE Assets (
>  asset_id serial primary key,
>  ...
> );
>
> TABLE Users (
>  user_id serial primary key,
>  user_name text
> );
>
> TABLE Permissions (
>  asset_id integer references(Assets),
>  user_id integer references(Users)
> )
>
> Now, I can generate UUIDs for my asset keys without any trouble, so
> the serial that I have in my pseudo-SQL Assets table isn't a problem.
> My problem is that I can't see a good way to model the relationship
> between user ids and assets.  I see one way to do this, which has
> problems, and I think I sort of see a second way.
>
> The obvious way to do it is have the Assets CF have a SuperColumn that
> somehow enumerates the users allowed to see it, so when retrieving a
> specific Asset I can retrieve the users list and ensure that the user
> doing the request is allowed to see it.  This has quite a few
> problems.  The foremost is that Cassandra doesn't appear to have much
> for conflict resolution (at least I can't find any docs on it), so if
> two processes try to add permissions to the same Asset, it looks like
> one process will win and I have no idea what happens to the loser.
> Another problem is that Cassandra's SuperColumns don't appear to be
> ideal for storing lists of things; they store maps, which isn't a
> terrible problem, but it feels like a bit of a mismatch in my design.
> A SuperColumn mapping from user_ids to an empty byte array seems like
> it should work pretty efficiently for checking whether a user has
> permissions on an Asset, but it also seems pretty evil.
>
> The other idea that I have is a seperate CF for AssetPermissions that
> somehow stores pairs of asset_ids and user_names.  I don't know what
> I'd use for a key in that situation, so I haven't really gotten too
> far in seeing what else is broken with that idea.  I think it would
> get around the race condition, but I don't know how to do it, and I'm
> not sure how efficient it could be.
>
> What do people normally use in this situation?  I assume it's a pretty
> common problem, but I haven't see it in the various data modelling
> examples on the Wiki.

I'm wondering, is my question too vague, too specific, off topic for
this list, or answered in the docs somewhere that I missed?


Filters

2010-04-20 Thread Christian Torres
Hello!

Is there any way to make filters (WHEREs) in cassandra? Or I have to manages
to do it

For example:

I have a ColumnFamily with a column in each row whose value is a state...
Public or Private, so I want to filter all rows that are private and also
the public ones in other form... Beside in that rows I will have names of
persons and I'll need to filter by Initials or Complete Lastnames, etc.

*So any idea?*

Regards

-- 
Christian Torres * Desarrollador Web * Guegue.com *
Celular: +505 84 65 92 62 * Loving of the Programming


Re: 0.6.1 insert 1B rows, crashed when using py_stress

2010-04-20 Thread Tatu Saloranta
On Mon, Apr 19, 2010 at 7:12 PM, Brandon Williams  wrote:
> On Mon, Apr 19, 2010 at 9:06 PM, Schubert Zhang  wrote:
>>
>> 2. Reject the request when be short of resource, instead of throws OOME
>> and exit (crash).
>
> Right, that is the crux of the problem  It will be addressed here:
> https://issues.apache.org/jira/browse/CASSANDRA-685

I think it would be great to get such "graceful degradation"
implemented: first thing any service should do is to protect itself
against meltdown.
Clients are better served by getting 50x responses (or rather its
equivalent for thrift), to indicate transient overload, than get
system into GC death spiral, where request time out but still consume
significant amounts of resources. Especially since returning error
response is usually rather cheap compared to doing full processing.
Also it should be then easy to hook up failure information via JMX to
expose it and allow alarming.

But this is of course more difficult with distributed set up,
especially since different QoS for different request would help (for
example: communication between nodes & other things related to
"accepted" requests should have higher priority than new incoming
requests).

-+ Tatu +-


RE: Filters

2010-04-20 Thread Mark Jones
You will have to pull the columns and filter yourself.

From: Christian Torres [mailto:chtor...@gmail.com]
Sent: Tuesday, April 20, 2010 11:50 AM
To: user@cassandra.apache.org
Cc: d...@cassandra.apache.org
Subject: Filters

Hello!

Is there any way to make filters (WHEREs) in cassandra? Or I have to manages to 
do it

For example:

I have a ColumnFamily with a column in each row whose value is a state... 
Public or Private, so I want to filter all rows that are private and also the 
public ones in other form... Beside in that rows I will have names of persons 
and I'll need to filter by Initials or Complete Lastnames, etc.

So any idea?

Regards

--
Christian Torres * Desarrollador Web * Guegue.com *
Celular: +505 84 65 92 62 * Loving of the Programming


RE: 0.6.1 insert 1B rows, crashed when using py_stress

2010-04-20 Thread Mark Jones
I would think this is on the roadmap, just not available yet.  It can be 
managed by adjusting the Heap size (to a large degree).

-Original Message-
From: Tatu Saloranta [mailto:tsalora...@gmail.com]
Sent: Tuesday, April 20, 2010 12:18 PM
To: user@cassandra.apache.org
Subject: Re: 0.6.1 insert 1B rows, crashed when using py_stress

On Mon, Apr 19, 2010 at 7:12 PM, Brandon Williams  wrote:
> On Mon, Apr 19, 2010 at 9:06 PM, Schubert Zhang  wrote:
>>
>> 2. Reject the request when be short of resource, instead of throws OOME
>> and exit (crash).
>
> Right, that is the crux of the problem  It will be addressed here:
> https://issues.apache.org/jira/browse/CASSANDRA-685

I think it would be great to get such "graceful degradation"
implemented: first thing any service should do is to protect itself
against meltdown.
Clients are better served by getting 50x responses (or rather its
equivalent for thrift), to indicate transient overload, than get
system into GC death spiral, where request time out but still consume
significant amounts of resources. Especially since returning error
response is usually rather cheap compared to doing full processing.
Also it should be then easy to hook up failure information via JMX to
expose it and allow alarming.

But this is of course more difficult with distributed set up,
especially since different QoS for different request would help (for
example: communication between nodes & other things related to
"accepted" requests should have higher priority than new incoming
requests).

-+ Tatu +-


Re: Re: Modelling assets and user permissions

2010-04-20 Thread charleswoerner
The short answer as to what people normally do is that they use a  
relational database for something like this.


I'm curious as to how you would have so many asset / user permissions that  
you couldn't use a standard relational database to model them. Is this some  
sort of multi-tenant system where you're providing some generalized asset  
check-out mechanism to many, many customers? Even so, I'm not sure the  
eventually consistent model wouldn't open you up to check-out collisions,  
as you mention yourself.


Am I missing something about your example?

On Apr 20, 2010 9:47am, tsuraan  wrote:

> Suppose I have a CF that holds some sort of assets that some users of



> my program have access to, and that some do not. In SQL-ish terms it



> would look something like this:



>



> TABLE Assets (



> asset_id serial primary key,



> ...



> );



>



> TABLE Users (



> user_id serial primary key,



> user_name text



> );



>



> TABLE Permissions (



> asset_id integer references(Assets),



> user_id integer references(Users)



> )



>



> Now, I can generate UUIDs for my asset keys without any trouble, so



> the serial that I have in my pseudo-SQL Assets table isn'ta problem.



> My problem is that I can't see a good way to model the relationship



> between user ids and assets. I see one way to do this, which has



> problems, and I think I sort of see a second way.



>



> The obvious way to do it is have the Assets CF have a SuperColumn that



> somehow enumerates the users allowed to see it, so when retrieving a



> specific Asset I can retrieve the users list and ensure that the user



> doing the request is allowed to see it. This has quite a few



> problems. The foremost is that Cassandra doesn't appear to have much



> for conflict resolution (at least I can't find any docs on it), so if



> two processes try to add permissions to the same Asset, it looks like



> one process will win and I have no idea what happens to the loser.



> Another problem is that Cassandra's SuperColumns don't appear to be



> ideal for storing lists of things; they store maps, which isn'ta



> terrible problem, but it feels like a bit of a mismatch in my design.



> A SuperColumn mapping from user_ids to an empty byte array seems like



> it should work pretty efficiently for checking whether a user has



> permissions on an Asset, but it also seems pretty evil.



>



> The other idea that I have is a seperate CF for AssetPermissions that



> somehow stores pairs of asset_ids and user_names. I don't know what



> I'd use for a key in that situation, so I haven't really gotten too



> far in seeing what else is broken with that idea. I think it would



> get around the race condition, but I don't know how to do it, and I'm



> not sure how efficient it could be.



>



> What do people normally use in this situation? I assume it's a pretty



> common problem, but I haven't see it in the various data modelling



> examples on the Wiki.





I'm wondering, is my question too vague, too specific, off topic for



this list, or answered in the docs somewhere that I missed?




Re: Tool for managing cluster nodes?

2010-04-20 Thread B. Todd Burruss





http://sourceforge.net/projects/clusterssh/

Roger Schildmeijer wrote:

  dancer's shell / distributed shell

http://www.netfort.gr.jp/~dancer/software/dsh.html.en

On 20 apr 2010, at 17.18em, Joost Ouwerkerk wrote:

  
  
What are people using to manage Cassandra cluster nodes?  i.e. to start, stop, copy config files, etc.  I'm using cssh and wondering if there is a better way...
Joost.

  
  
  





Re: Filters

2010-04-20 Thread Christian Torres
Mmmm...

According with this doc http://wiki.apache.org/cassandra/API#get_slice that
a developer mailed to me It's possible!!

I sent you as reference

On Tue, Apr 20, 2010 at 11:17 AM, Mark Jones  wrote:

>  You will have to pull the columns and filter yourself.
>
>
>
> *From:* Christian Torres [mailto:chtor...@gmail.com]
> *Sent:* Tuesday, April 20, 2010 11:50 AM
> *To:* user@cassandra.apache.org
> *Cc:* d...@cassandra.apache.org
> *Subject:* Filters
>
>
>
> Hello!
>
> Is there any way to make filters (WHEREs) in cassandra? Or I have tomanages 
> to doit
>
> For example:
>
> I have a ColumnFamily with a column in each row whose value is a state...
> Public or Private, so I want to filter all rows that are private and also
> the public ones in other form... Beside in that rows I will have names of
> persons and I'll need to filter by Initials or Complete Lastnames, etc.
>
> *So any idea?*
>
> Regards
>
> --
> Christian Torres * Desarrollador Web * Guegue.com *
> Celular: +505 84 65 92 62 * Loving of the Programming
>



-- 
Christian Torres * Desarrollador Web * Guegue.com *
Celular: +505 84 65 92 62 * Loving of the Programming


Re: Cassandra Java Client

2010-04-20 Thread Nathan McCall
Dop,
Thank you for trying out hector. I think you have the right approach
for using it with your project. Feel free to ping us directly
regarding Hector on either of these mailings lists as appropriate:
http://wiki.github.com/rantav/hector/mailing-lists

Cheers,
-Nate

On Tue, Apr 20, 2010 at 7:11 AM, Dop Sun  wrote:
> Hi,
>
>
>
> I have downloaded hector-0.6.0-10.jar. As you mentioned, it has good
> implementation for the connection pooling, JMX counters.
>
>
>
> What I’m doing is: using Hector to create the Cassandra client (be specific:
> borrow_client(url, port)). And my understanding is: in this way, the
> Jassandra will enjoy the client pool and JMX counter.
>
>
>
> http://code.google.com/p/jassandra/issues/detail?id=17
>
>
>
> Please feel free to let me know if you have any suggestions.
>
>
>
> The new build 1.0.0 build 3(http://code.google.com/p/jassandra/) created.
> From Jassandra client side, no API changes.
>
>
>
> Cheers~~~
>
> Dop
>
>
>
> From: Ran Tavory [mailto:ran...@gmail.com]
> Sent: Tuesday, April 20, 2010 1:36 AM
> To: user@cassandra.apache.org
> Subject: Re: Cassandra Java Client
>
>
>
> Hi Dop, you may want to look at hector as a low level cassandra client on
> which you build jassandra, adding hibernate style magic etc like other ppl
> have done with ORM layers on top of it.
>
> Hector's main features include extensive jmx counters, failover and
> connection pooling.
>
> It's available for all recent versions, including 0.5.0, 0.5.1, 0.6.0 and
> 0.6.1
>
> On Mon, Apr 19, 2010 at 5:58 PM, Dop Sun  wrote:
>
> Well, there are couple of points while Jassandra is created:
>
> 1. First of all, I want to create something like that is because I come from
> JDBC background, and familiar with Hibernate API. The ICriteria (which is
> created for querying) is inspired by the Criteria API from hibernate.
>
> Actually, maybe because of this background, it cost me a lot efforts try to
> understand Cassandra in the beginning and Thrift API also takes time to use.
>
> 2. The Jassandra creates a layer, which removes the direct link to
> underlying Thrift API (including the exceptions, ConsistencyLevel
> enumeration etc)
>
> High light this point because I believe the client of the Jassandra will
> benefit for the implementation changes in future, for example, if the
> Cassandra provides better Thrift API to selecting the columns for a list of
> keys, SCFs, or deprecating some structures, exceptions, the client may not
> be changed. Of cause, if Jassandra failed to approve itself, this is
> actually not the advantage. :)
>
> 3. The Jassandra is designed to be an JDBC like API, no less, no more. It
> strives to use the best API to do the quering (with token, key, SCF/ CF),
> doing the CRUD, but no more than that. For example, it does not cover any
> API like object mapping. But it should cover all the API functionalities
> Thrift provided.
>
> These 3 points, are different from Hector (I should be honest that I have
> not tried to use it before, the feeling of difference are coming from the
> sample code Hector provided).
>
> So, the API Jassandra abstracted was something like this:
>
>    IConnection connection = DriverManager.getConnection(
>        "thrift://localhost:9160", info);
>    try {
>      // 2. Get a KeySpace by name
>      IKeySpace keySpace = connection.getKeySpace("Keyspace1");
>
>      // 3. Get a ColumnFamily by name
>      IColumnFamily cf = keySpace.getColumnFamily("Standard2");
>
>      // 4. Insert like this
>      long now = System.currentTimeMillis();
>      ByteArray nameFirst = ByteArray.ofASCII("first");
>      ByteArray nameLast = ByteArray.ofASCII("last");
>      ByteArray nameAge = ByteArray.ofASCII("age");
>      ByteArray valueLast = ByteArray.ofUTF8("Smith");
>      IColumn colFirst = new Column(nameFirst, ByteArray.ofUTF8("John"),
> now);
>      cf.insert(userName, colFirst);
>
>      IColumn colLast = new Column(nameLast, valueLast, now);
>      cf.insert(userName, colLast);
>
>      IColumn colAge = new Column(nameAge, ByteArray.ofLong(42), now);
>      cf.insert(userName, colAge);
>
>      // 5. Select like this
>      ICriteria criteria = cf.createCriteria();
>      criteria.keyList(Lists.newArrayList(userName))
>          .columnRange(nameAge, nameLast, 10);
>      Map> map = criteria.select();
>      List list = map.get(userName);
>      Assert.assertEquals(3, list.size());
>      Assert.assertEquals(valueLast, list.get(2).getValue());
>
>      // 6. Delete like this
>      cf.delete(userName, colFirst);
>      map = criteria.select();
>      Assert.assertEquals(2, map.get(userName).size());
>
>      // 7. Get count like this
>      criteria = cf.createCriteria();
>      criteria.keyList(Lists.newArrayList(userName));
>      int count = criteria.count();
>      Assert.assertEquals(2, count);
>    } finally {
>      // 8. Don't forget to close the connection.
>      connection.close();
>
>    }
>  }
>
> -Original Message-
> From: Jonathan Ellis [mailto:jb

Re: Cassandra Java Client

2010-04-20 Thread Ran Tavory
great, I'm happy you found hector useful and reused it in your client.

On Tue, Apr 20, 2010 at 5:11 PM, Dop Sun  wrote:

>  Hi,
>
>
>
> I have downloaded hector-0.6.0-10.jar. As you mentioned, it has good
> implementation for the connection pooling, JMX counters.
>
>
>
> What I’m doing is: using Hector to create the Cassandra client (be
> specific: borrow_client(url, port)). And my understanding is: in this way,
> the Jassandra will enjoy the client pool and JMX counter.
>
>
>
> http://code.google.com/p/jassandra/issues/detail?id=17
>
>
>
> Please feel free to let me know if you have any suggestions.
>
>
>
> The new build 1.0.0 build 3(http://code.google.com/p/jassandra/) created.
> From Jassandra client side, no API changes.
>
>
>
> Cheers~~~
>
> Dop
>
>
>
> *From:* Ran Tavory [mailto:ran...@gmail.com]
> *Sent:* Tuesday, April 20, 2010 1:36 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Cassandra Java Client
>
>
>
> Hi Dop, you may want to look at hector as a low level cassandra client on
> which you build jassandra, adding hibernate style magic etc like other ppl
> have done with ORM layers on top of it.
>
> Hector's main features include extensive jmx counters, failover and
> connection pooling.
>
> It's available for all recent versions, including 0.5.0, 0.5.1, 0.6.0 and
> 0.6.1
>
> On Mon, Apr 19, 2010 at 5:58 PM, Dop Sun  wrote:
>
> Well, there are couple of points while Jassandra is created:
>
> 1. First of all, I want to create something like that is because I come
> from
> JDBC background, and familiar with Hibernate API. The ICriteria (which is
> created for querying) is inspired by the Criteria API from hibernate.
>
> Actually, maybe because of this background, it cost me a lot efforts try to
> understand Cassandra in the beginning and Thrift API also takes time to
> use.
>
> 2. The Jassandra creates a layer, which removes the direct link to
> underlying Thrift API (including the exceptions, ConsistencyLevel
> enumeration etc)
>
> High light this point because I believe the client of the Jassandra will
> benefit for the implementation changes in future, for example, if the
> Cassandra provides better Thrift API to selecting the columns for a list of
> keys, SCFs, or deprecating some structures, exceptions, the client may not
> be changed. Of cause, if Jassandra failed to approve itself, this is
> actually not the advantage. :)
>
> 3. The Jassandra is designed to be an JDBC like API, no less, no more. It
> strives to use the best API to do the quering (with token, key, SCF/ CF),
> doing the CRUD, but no more than that. For example, it does not cover any
> API like object mapping. But it should cover all the API functionalities
> Thrift provided.
>
> These 3 points, are different from Hector (I should be honest that I have
> not tried to use it before, the feeling of difference are coming from the
> sample code Hector provided).
>
> So, the API Jassandra abstracted was something like this:
>
>IConnection connection = DriverManager.getConnection(
>"thrift://localhost:9160", info);
>try {
>  // 2. Get a KeySpace by name
>  IKeySpace keySpace = connection.getKeySpace("Keyspace1");
>
>  // 3. Get a ColumnFamily by name
>  IColumnFamily cf = keySpace.getColumnFamily("Standard2");
>
>  // 4. Insert like this
>  long now = System.currentTimeMillis();
>  ByteArray nameFirst = ByteArray.ofASCII("first");
>  ByteArray nameLast = ByteArray.ofASCII("last");
>  ByteArray nameAge = ByteArray.ofASCII("age");
>  ByteArray valueLast = ByteArray.ofUTF8("Smith");
>  IColumn colFirst = new Column(nameFirst, ByteArray.ofUTF8("John"),
> now);
>  cf.insert(userName, colFirst);
>
>  IColumn colLast = new Column(nameLast, valueLast, now);
>  cf.insert(userName, colLast);
>
>  IColumn colAge = new Column(nameAge, ByteArray.ofLong(42), now);
>  cf.insert(userName, colAge);
>
>  // 5. Select like this
>  ICriteria criteria = cf.createCriteria();
>  criteria.keyList(Lists.newArrayList(userName))
>  .columnRange(nameAge, nameLast, 10);
>  Map> map = criteria.select();
>  List list = map.get(userName);
>  Assert.assertEquals(3, list.size());
>  Assert.assertEquals(valueLast, list.get(2).getValue());
>
>  // 6. Delete like this
>  cf.delete(userName, colFirst);
>  map = criteria.select();
>  Assert.assertEquals(2, map.get(userName).size());
>
>  // 7. Get count like this
>  criteria = cf.createCriteria();
>  criteria.keyList(Lists.newArrayList(userName));
>  int count = criteria.count();
>  Assert.assertEquals(2, count);
>} finally {
>  // 8. Don't forget to close the connection.
>  connection.close();
>
>}
>  }
>
> -Original Message-
> From: Jonathan Ellis [mailto:jbel...@gmail.com]
> Sent: Monday, April 19, 2010 10:35 PM
> To: user@cassandra.apache.org
>
> Subject: Re: Cassandra Java Client
>
> How is Jassandra different from http://github.com/ra

RE: Filters

2010-04-20 Thread Mark Jones
If you notice the SlicePredicate accepts column names, but not values.  You can 
tell it pull these 3 columns, but there is no "if/where" in there.

SliceRange is I think, based on the key, since it doesn't have a way to pair up 
column names/values

From: Christian Torres [mailto:chtor...@gmail.com]
Sent: Tuesday, April 20, 2010 12:25 PM
To: user@cassandra.apache.org
Subject: Re: Filters

Mmmm...

According with this doc http://wiki.apache.org/cassandra/API#get_slice that a 
developer mailed to me It's possible!!

I sent you as reference
On Tue, Apr 20, 2010 at 11:17 AM, Mark Jones 
mailto:mjo...@imagehawk.com>> wrote:
You will have to pull the columns and filter yourself.

From: Christian Torres [mailto:chtor...@gmail.com]
Sent: Tuesday, April 20, 2010 11:50 AM
To: user@cassandra.apache.org
Cc: d...@cassandra.apache.org
Subject: Filters

Hello!

Is there any way to make filters (WHEREs) in cassandra? Or I have to manages to 
do it

For example:

I have a ColumnFamily with a column in each row whose value is a state... 
Public or Private, so I want to filter all rows that are private and also the 
public ones in other form... Beside in that rows I will have names of persons 
and I'll need to filter by Initials or Complete Lastnames, etc.

So any idea?

Regards

--
Christian Torres * Desarrollador Web * Guegue.com *
Celular: +505 84 65 92 62 * Loving of the Programming



--
Christian Torres * Desarrollador Web * Guegue.com *
Celular: +505 84 65 92 62 * Loving of the Programming


Re: Filters

2010-04-20 Thread Miguel Verde
http://wiki.apache.org/cassandra/API#get_slice
get_slice retrieves the values for either (a) a list of column names or (b)
a range of columns, depending on the SlicePredicate you use.  It does not
allow you to filter a la SQL's WHERE.  You would need to create your own
index to do so, at least until secondary indices are implemented in
Cassandra (not until 0.8 at least, feel free to follow
https://issues.apache.org/jira/browse/CASSANDRA-749 )
On Tue, Apr 20, 2010 at 12:24 PM, Christian Torres wrote:

> Mmmm...
>
> According with this doc http://wiki.apache.org/cassandra/API#get_slicethat a 
> developer mailed to me It's possible!!
>
> I sent you as reference
>
>
> On Tue, Apr 20, 2010 at 11:17 AM, Mark Jones  wrote:
>
>>  You will have to pull the columns and filter yourself.
>>
>>
>>
>> *From:* Christian Torres [mailto:chtor...@gmail.com]
>> *Sent:* Tuesday, April 20, 2010 11:50 AM
>> *To:* user@cassandra.apache.org
>> *Cc:* d...@cassandra.apache.org
>> *Subject:* Filters
>>
>>
>>
>> Hello!
>>
>> Is there any way to make filters (WHEREs) in cassandra? Or I have tomanages 
>> to doit
>>
>> For example:
>>
>> I have a ColumnFamily with a column in each row whose value is a state...
>> Public or Private, so I want to filter all rows that are private and also
>> the public ones in other form... Beside in that rows I will have names of
>> persons and I'll need to filter by Initials or Complete Lastnames, etc.
>>
>> *So any idea?*
>>
>> Regards
>>
>> --
>> Christian Torres * Desarrollador Web * Guegue.com *
>> Celular: +505 84 65 92 62 * Loving of the Programming
>>
>
>
>
> --
> Christian Torres * Desarrollador Web * Guegue.com *
> Celular: +505 84 65 92 62 * Loving of the Programming
>


Re: Filters

2010-04-20 Thread Roger Schildmeijer
My bad. Missed your one-to-one relationship (row key <-> column

)
On 20 apr 2010, at 19.24em, Christian Torres wrote:

> Mmmm...
> 
> According with this doc http://wiki.apache.org/cassandra/API#get_slice that a 
> developer mailed to me It's possible!!
> 
> I sent you as reference
> 
> On Tue, Apr 20, 2010 at 11:17 AM, Mark Jones  wrote:
> You will have to pull the columns and filter yourself.
> 
>  
> From: Christian Torres [mailto:chtor...@gmail.com] 
> Sent: Tuesday, April 20, 2010 11:50 AM
> To: user@cassandra.apache.org
> Cc: d...@cassandra.apache.org
> Subject: Filters
> 
>  
> Hello!
> 
> Is there any way to make filters (WHEREs) in cassandra? Or I have to manages 
> to do it
> 
> For example:
> 
> I have a ColumnFamily with a column in each row whose value is a state... 
> Public or Private, so I want to filter all rows that are private and also the 
> public ones in other form... Beside in that rows I will have names of persons 
> and I'll need to filter by Initials or Complete Lastnames, etc.
> 
> So any idea?
> 
> Regards
> 
> -- 
> Christian Torres * Desarrollador Web * Guegue.com *
> Celular: +505 84 65 92 62 * Loving of the Programming
> 
> 
> 
> 
> -- 
> Christian Torres * Desarrollador Web * Guegue.com *
> Celular: +505 84 65 92 62 * Loving of the Programming



cleaning house

2010-04-20 Thread B. Todd Burruss
i'm trying to draw some correlation between the size of my data and the 
space used on disk.  i have set 1 so 
there isn't any reason to keep data around.


my approach is this:

after only doing "puts" to cassandra for a while i stop my client and 
want to perform the proper "cleanup" and/or "compact" operations that 
will reduce the disk space used to a minimum.  however i can't seem to 
figure it out.  i've done "major compaction", "cleanup", etc. but 
doesn't seem to get the job done


so two questions

- what procedure is suggested to get rid of all unnecessary data?
- and what does the following "Compacted" file mean?  seams like it is 
marking "88" as compacted, but there are no more compactions happening 
according to compaction mgr


-rw-rw-r-- 1 bburruss bburruss  0 Apr 20 08:32 bucket-88-Compacted
-rw-rw-r-- 1 bburruss bburruss 1445218042 Apr 19 21:39 bucket-88-Data.db
-rw-rw-r-- 1 bburruss bburruss   12255925 Apr 19 21:39 bucket-88-Filter.db
-rw-rw-r-- 1 bburruss bburruss  451806386 Apr 19 21:39 bucket-88-Index.db



Re: 0.6 insert performance .... Re: [RELEASE] 0.6.1

2010-04-20 Thread Masood Mortazavi
You're welcome Schubert.
I look forward to any new results you may come up with.

{ It would also be interesting, when you run your tests again, to look at
the GC logs and see to what extent
https://issues.apache.org/jira/browse/CASSANDRA-896 is the culprit for what
you will see. Identifying any other bottlenecks would be good, too. By the
way, it is always good to list what JVM you're using. }

On Mon, Apr 19, 2010 at 8:18 PM, Schubert Zhang  wrote:

> Since the scale of GC graph in the slides is different from the throughput
> ones. I will do another test for this issue.
> Thanks for your advices, Masood and Jonathan.
>
> ---
> Here, i just post my cossandra.in.sh.
> JVM_OPTS=" \
> -ea \
> -Xms128M \
> -Xmx6G \
> -XX:TargetSurvivorRatio=90 \
> -XX:+AggressiveOpts \
> -XX:+UseParNewGC \
> -XX:+UseConcMarkSweepGC \
> -XX:+CMSParallelRemarkEnabled \
> -XX:SurvivorRatio=128 \
> -XX:MaxTenuringThreshold=0 \
> -Dcom.sun.management.jmxremote.port=8081 \
> -Dcom.sun.management.jmxremote.ssl=false \
> -Dcom.sun.management.jmxremote.authenticate=false"
>
>
> On Tue, Apr 20, 2010 at 5:46 AM, Masood Mortazavi <
> masoodmortaz...@gmail.com> wrote:
>
>> Minimizing GC pauses or minimizing time slots allocated to GC pauses --
>> either through configuration or re-implementations of garbage collection
>> "bottlenecks" (i.e. object-generation "bottlenecks") -- seem to be the
>> immediate approach. (Other approaches appear to be more intrusive.)
>> At code level, using the GC logs, one can investigate further. There may
>> be places were some object recycling can make some larger difference.
>> Trying this first will probably bear more immediate fruit.
>>
>> - m.
>>
>>
>> On Mon, Apr 19, 2010 at 9:11 AM, Daniel Kluesing  wrote:
>>
>>>  We see this behavior as well with 0.6, heap usage graphs look almost
>>> identical. The GC is a noticeable bottleneck, we’ve tried jdku19 and jrockit
>>> vm’s. It basically kills any kind of soft real time behavior.
>>>
>>>
>>>
>>> *From:* Masood Mortazavi [mailto:masoodmortaz...@gmail.com]
>>> *Sent:* Monday, April 19, 2010 4:15 AM
>>> *To:* user@cassandra.apache.org; d...@cassandra.apache.org
>>> *Subject:* 0.6 insert performance  Re: [RELEASE] 0.6.1
>>>
>>>
>>>
>>> I wonder if anyone can use:
>>>
>>>  * Add logging of GC activity (CASSANDRA-813)
>>> to confirm this:
>>>
>>> http://www.slideshare.net/schubertzhang/cassandra-060-insert-throughput
>>>
>>> - m.
>>>
>>>  On Sun, Apr 18, 2010 at 6:58 PM, Eric Evans 
>>> wrote:
>>>
>>>
>>> Hot on the trails of 0.6.0 comes our latest, 0.6.1. This stable point
>>> release contains a number of important bugfixes[1] and is a painless
>>> upgrade from 0.6.0.
>>>
>>> Enjoy!
>>>
>>> [1]: http://bit.ly/9NqwAb (changelog)
>>>
>>> --
>>> Eric Evans
>>> eev...@rackspace.com
>>>
>>>
>>>
>>
>>
>


Re: Re: Modelling assets and user permissions

2010-04-20 Thread tsuraan
> I'm curious as to how you would have so many asset / user permissions that
> you couldn't use a standard relational database to model them. Is this some
> sort of multi-tenant system where you're providing some generalized asset
> check-out mechanism to many, many customers? Even so, I'm not sure the
> eventually consistent model wouldn't open you up to check-out collisions, as
> you mention yourself.

The assets are binary files on a document tracking system.  Our
current platform is postgres-backed; the entire system we've written
is fairly easily distributed across multiple computers, but postgres
isn't.  There are reliable databases that do scale out, but they tend
to be a little on the pricey side...  Our current system works well in
the tens to hundreds of millions of documents with hundreds of users,
but we're hitting the billions of documents with thousands of users,
so cassandra's scaling properties are pretty appealing there.

I don't think eventual consistency would be a terrible problem; so
long as our system lives in a rack, or at least in a single data
center I think the database would become consistent before the
documents would be visible by any users of the system.

> Am I missing something about your example?

Just the scale, I think.  I like relational databases, but I'm really
interested in trying out cassandra's way, if I can come up with a sane
way to model my system in it.


Delete row

2010-04-20 Thread Sonny Heer
How do i delete a row using BMT method?

Do I simply do a mutate with column delete flag set to true?  Thanks.


Re: cleaning house

2010-04-20 Thread Benjamin Black
Are you deleting data through the API or just doing a bunch of inserts
and then running a compaction?  The latter will not result in anything
to clean up since data must be explicitly deleted.


b

On Tue, Apr 20, 2010 at 10:33 AM, B. Todd Burruss  wrote:
> i'm trying to draw some correlation between the size of my data and the
> space used on disk.  i have set 1 so there
> isn't any reason to keep data around.
>
> my approach is this:
>
> after only doing "puts" to cassandra for a while i stop my client and want
> to perform the proper "cleanup" and/or "compact" operations that will reduce
> the disk space used to a minimum.  however i can't seem to figure it out.
>  i've done "major compaction", "cleanup", etc. but doesn't seem to get the
> job done
>
> so two questions
>
> - what procedure is suggested to get rid of all unnecessary data?
> - and what does the following "Compacted" file mean?  seams like it is
> marking "88" as compacted, but there are no more compactions happening
> according to compaction mgr
>
> -rw-rw-r-- 1 bburruss bburruss          0 Apr 20 08:32 bucket-88-Compacted
> -rw-rw-r-- 1 bburruss bburruss 1445218042 Apr 19 21:39 bucket-88-Data.db
> -rw-rw-r-- 1 bburruss bburruss   12255925 Apr 19 21:39 bucket-88-Filter.db
> -rw-rw-r-- 1 bburruss bburruss  451806386 Apr 19 21:39 bucket-88-Index.db
>
>


Re: cleaning house

2010-04-20 Thread Jonathan Ellis
Added to http://wiki.apache.org/cassandra/MemtableSSTable:

SSTables that are obsoleted by a compaction are deleted asynchronously
when the JVM performs a GC.  You can force a GC from jconsole if
necessary but this is not necessary; Cassandra will force one itself
if it detects that it is low on space.  A compaction marker is also
added to obsolete sstables so they can be deleted on startup if the
server does not perform a GC before being restarted.

CFStoreMBean exposes sstable space used as getLiveDiskSpaceUsed (only
includes size of non-obsolete files) and getLiveDiskSpaceUsed
(includes everything).


On Tue, Apr 20, 2010 at 12:33 PM, B. Todd Burruss  wrote:
> i'm trying to draw some correlation between the size of my data and the
> space used on disk.  i have set 1 so there
> isn't any reason to keep data around.
>
> my approach is this:
>
> after only doing "puts" to cassandra for a while i stop my client and want
> to perform the proper "cleanup" and/or "compact" operations that will reduce
> the disk space used to a minimum.  however i can't seem to figure it out.
>  i've done "major compaction", "cleanup", etc. but doesn't seem to get the
> job done
>
> so two questions
>
> - what procedure is suggested to get rid of all unnecessary data?
> - and what does the following "Compacted" file mean?  seams like it is
> marking "88" as compacted, but there are no more compactions happening
> according to compaction mgr
>
> -rw-rw-r-- 1 bburruss bburruss          0 Apr 20 08:32 bucket-88-Compacted
> -rw-rw-r-- 1 bburruss bburruss 1445218042 Apr 19 21:39 bucket-88-Data.db
> -rw-rw-r-- 1 bburruss bburruss   12255925 Apr 19 21:39 bucket-88-Filter.db
> -rw-rw-r-- 1 bburruss bburruss  451806386 Apr 19 21:39 bucket-88-Index.db
>
>


Re: cleaning house

2010-04-20 Thread B. Todd Burruss
i have done no deletes, just inserts.  so you are correct, there isn't 
any "data" to cleanup.  however when i run some of the cleanup and/or 
compaction tasks the space used on disk actually grows, and i would like 
to force any unneeded files to be removed.  as i write this, jonathan 
has responded with i believe what i need.


thx!

Benjamin Black wrote:

Are you deleting data through the API or just doing a bunch of inserts
and then running a compaction?  The latter will not result in anything
to clean up since data must be explicitly deleted.


b

On Tue, Apr 20, 2010 at 10:33 AM, B. Todd Burruss  wrote:
  

i'm trying to draw some correlation between the size of my data and the
space used on disk.  i have set 1 so there
isn't any reason to keep data around.

my approach is this:

after only doing "puts" to cassandra for a while i stop my client and want
to perform the proper "cleanup" and/or "compact" operations that will reduce
the disk space used to a minimum.  however i can't seem to figure it out.
 i've done "major compaction", "cleanup", etc. but doesn't seem to get the
job done

so two questions

- what procedure is suggested to get rid of all unnecessary data?
- and what does the following "Compacted" file mean?  seams like it is
marking "88" as compacted, but there are no more compactions happening
according to compaction mgr

-rw-rw-r-- 1 bburruss bburruss  0 Apr 20 08:32 bucket-88-Compacted
-rw-rw-r-- 1 bburruss bburruss 1445218042 Apr 19 21:39 bucket-88-Data.db
-rw-rw-r-- 1 bburruss bburruss   12255925 Apr 19 21:39 bucket-88-Filter.db
-rw-rw-r-- 1 bburruss bburruss  451806386 Apr 19 21:39 bucket-88-Index.db





Re: cleaning house

2010-04-20 Thread B. Todd Burruss

thx, that did the trick.

Jonathan Ellis wrote:

Added to http://wiki.apache.org/cassandra/MemtableSSTable:

SSTables that are obsoleted by a compaction are deleted asynchronously
when the JVM performs a GC.  You can force a GC from jconsole if
necessary but this is not necessary; Cassandra will force one itself
if it detects that it is low on space.  A compaction marker is also
added to obsolete sstables so they can be deleted on startup if the
server does not perform a GC before being restarted.

CFStoreMBean exposes sstable space used as getLiveDiskSpaceUsed (only
includes size of non-obsolete files) and getLiveDiskSpaceUsed
(includes everything).


On Tue, Apr 20, 2010 at 12:33 PM, B. Todd Burruss  wrote:
  

i'm trying to draw some correlation between the size of my data and the
space used on disk.  i have set 1 so there
isn't any reason to keep data around.

my approach is this:

after only doing "puts" to cassandra for a while i stop my client and want
to perform the proper "cleanup" and/or "compact" operations that will reduce
the disk space used to a minimum.  however i can't seem to figure it out.
 i've done "major compaction", "cleanup", etc. but doesn't seem to get the
job done

so two questions

- what procedure is suggested to get rid of all unnecessary data?
- and what does the following "Compacted" file mean?  seams like it is
marking "88" as compacted, but there are no more compactions happening
according to compaction mgr

-rw-rw-r-- 1 bburruss bburruss  0 Apr 20 08:32 bucket-88-Compacted
-rw-rw-r-- 1 bburruss bburruss 1445218042 Apr 19 21:39 bucket-88-Data.db
-rw-rw-r-- 1 bburruss bburruss   12255925 Apr 19 21:39 bucket-88-Filter.db
-rw-rw-r-- 1 bburruss bburruss  451806386 Apr 19 21:39 bucket-88-Index.db





Re: How to increase cassandra's performance in read?

2010-04-20 Thread Benjamin Black
I can't answer for its sanity, but I would not do it that way.  I'd
have a CF for Emails, with 1 email per row, and another CF for
UserEmails with per-user index rows referencing the Emails rows.


b

On Tue, Apr 20, 2010 at 9:44 AM, Mark Jones  wrote:
> To make sure I'm clear on what you are saying:
>
>  Are the "Individual Emails" in the example below, Supercolumns and the 
> {body, header, tags...} the subcolumns?
>
> Is that a sane data layout for an email system?  Where the Supercolumn 
> identifier is the "conversation label"
>
> Sorry to be so daft, but the way columns and rows are bandied about in NoSQL 
> is a bit confusing when you are coming from a SQL background.  I can't see 
> why you would want multiple emails in the same row since they each have the 
> same "columns" of information and therefore make good logical entities as 
> outlined below.
>
> -Original Message-
> From: Jonathan Ellis [mailto:jbel...@gmail.com]
> Sent: Tuesday, April 20, 2010 11:16 AM
> To: user@cassandra.apache.org
> Subject: Re: How to increase cassandra's performance in read?
>
> Not all the data associated w/ the key is brought into memory, just
> all the data associated w/ the supercolumns being queried.
>
> Supercolumns are so you can update a smallish number of subcolumns
> independently (e.g. when denormalizing an entire narrow row, usually
> with a finite set of columns).  If you want lots of subcolumns you
> need to turn that supercolumn into a new row.
>
> On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones  wrote:
>> When I first read this, it bothered me because it seemed like it couldn't be 
>> so.  So I read the link, and it says the whole thing, so I have to ask for 
>> some classification here.
>>
>> I had always assumed a super column was similar to a local keyspace, and 
>> that the SubColumns under it were similar to keys, that way you could 
>> localize the data for a user or a website.
>>
>> So Keyspace:Email
>>  Key:UserID
>>     SuperColumn Entries:
>>        Individual Email 1:  Columns {body, header, tags, recipients, flags, 
>> whatever}
>>        Individual Email 2:  Columns {body, header, tags, recipients, flags, 
>> whatever}
>>        Individual Email 3:  Columns {body, header, tags, recipients, flags, 
>> whatever}
>>
>> I think now this is probably the wrong concept.
>>
>> It is really more like:
>>        Primary Key: Name:Value pairs
>>
>> And with Supercolumns, the Value part can be another Hash:
>>        Primary Key: Name: {Name:Value pairs} pairs
>>
>> But when I lookup by Primary Key, ALL of the data associated with the key 
>> will be brought into memory!  So, when if I wanted to display the inbox of a 
>> user with several years of email, it would be one HUGE read to suck his 
>> entire inbox into memory to get down to the point I could display one 
>> message.
>>
>> Is this more correct?
>>
>> -Original Message-
>> From: Jonathan Ellis [mailto:jbel...@gmail.com]
>> Sent: Tuesday, April 20, 2010 10:47 AM
>> To: user@cassandra.apache.org
>> Subject: Re: How to increase cassandra's performance in read?
>>
>> How many columns are in the supercolumn total?
>>
>> "in super columnfamilies there is a third level of subcolumns; these
>> are not indexed, and any request for a subcolumn deserializes _all_
>> the subcolumns in that supercolumn"
>>
>> http://wiki.apache.org/cassandra/CassandraLimitations
>>
>> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones  wrote:
>>> I too am seeing very slow performance while testing worst case scenarios of
>>> 1 key leading to 1 supercolumn and 1 column beyond that.
>>>
>>>
>>>
>>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>>>
>>>
>>>
>>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
>>> (With NO swapping)  So far, I've found nothing that helps, including
>>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
>>> prevents better cache performance.
>>>
>>>
>>>
>>> Read performance is definitely not 3 IOs based on the utilization factors on
>>> my drives.  I'm not sure the issue was ever settled in the previous e-mails
>>> as to how to calculate how many IOs were being done for each read.  I've
>>> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing
>>> with multiple machines, is lower performance in a cluster than alone.  I
>>> keep assuming that at some number of nodes, the performance will begin to
>>> pick up.  Three of my nodes are running with 8GB (6GB Java Heap), and one
>>> has 4GB (3GB Java Heap).  The machine with the smallest memory footprint is
>>> the fastest performer on inserts, but definitely not the fastest on reads.
>>>
>>>
>>>
>>> I'm suspecting the read path is relying heavily on the fact that you want to
>>> get many columns that are closely related, because lookup by key appears to
>>> be incredibly slow.
>>>
>>>
>>>
>>> From: yangfeng [mailto:yea...@gmail.com]
>>> Sent: Tuesday, April 20, 2010 7:59 AM
>>> To: user@cassandra.apache.org; d...@cassandra.a

Re: get_range_slices in hector

2010-04-20 Thread Ran Tavory
We haven't gotten around to implementing this yet and so far no one needed
that badly enough to write it.
We accept contributions or forks and we use github, so feel free to diy
(forks are preferable). http://github.com/rantav/hector

On Tue, Apr 20, 2010 at 3:25 AM, Chris Dean  wrote:

> Ok, thanks.
>
> Cheers,
> Chris Dean
>
> Nathan McCall  writes:
> > Not yet. If you wanted to provide a patch that would be much
> > appreciated. A fork and pull request would be best logistically, but
> > whatever works.
> >
> > -Nate
> >
> > On Mon, Apr 19, 2010 at 5:10 PM, Chris Dean  wrote:
> >> Is there a version of hector that has an interface to get_range_slices ?
> >> or should I provide a patch?
> >>
> >> Cheers,
> >> Chris Dean
> >>
>


RE: How to increase cassandra's performance in read?

2010-04-20 Thread Mark Jones
When I look at this arrangement, I see one lookup by key for the user, followed 
by a large read for all the "email indexes"  (these are all columns in the same 
row, right?)

Then one lookup by key for each email  Seems very seek intensive.


Would a better way be to index each email with a key of

UserID:ConvoID:Time

And then use the Order Preserving Partitioner?

That way I could at least use a get_range and the inbox is clustered together 
which should greatly shorten the amount of time seeking for keys.

However if I rolled all the inbox details into each column  
(subject/date/sender/flags), I would only have to seek when I want to display 
the entire message.

Hmmm, definitely presents a different way to think of things.

Ok, so If I do it this way, the # of keys rapidly goes into the billions, does 
that not cause other problems?  Seems like many more data/index files


-Original Message-
From: Benjamin Black [mailto:b...@b3k.us]
Sent: Tuesday, April 20, 2010 1:00 PM
To: user@cassandra.apache.org
Subject: Re: How to increase cassandra's performance in read?

I can't answer for its sanity, but I would not do it that way.  I'd
have a CF for Emails, with 1 email per row, and another CF for
UserEmails with per-user index rows referencing the Emails rows.


b

On Tue, Apr 20, 2010 at 9:44 AM, Mark Jones  wrote:
> To make sure I'm clear on what you are saying:
>
>  Are the "Individual Emails" in the example below, Supercolumns and the 
> {body, header, tags...} the subcolumns?
>
> Is that a sane data layout for an email system?  Where the Supercolumn 
> identifier is the "conversation label"
>
> Sorry to be so daft, but the way columns and rows are bandied about in NoSQL 
> is a bit confusing when you are coming from a SQL background.  I can't see 
> why you would want multiple emails in the same row since they each have the 
> same "columns" of information and therefore make good logical entities as 
> outlined below.
>
> -Original Message-
> From: Jonathan Ellis [mailto:jbel...@gmail.com]
> Sent: Tuesday, April 20, 2010 11:16 AM
> To: user@cassandra.apache.org
> Subject: Re: How to increase cassandra's performance in read?
>
> Not all the data associated w/ the key is brought into memory, just
> all the data associated w/ the supercolumns being queried.
>
> Supercolumns are so you can update a smallish number of subcolumns
> independently (e.g. when denormalizing an entire narrow row, usually
> with a finite set of columns).  If you want lots of subcolumns you
> need to turn that supercolumn into a new row.
>
> On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones  wrote:
>> When I first read this, it bothered me because it seemed like it couldn't be 
>> so.  So I read the link, and it says the whole thing, so I have to ask for 
>> some classification here.
>>
>> I had always assumed a super column was similar to a local keyspace, and 
>> that the SubColumns under it were similar to keys, that way you could 
>> localize the data for a user or a website.
>>
>> So Keyspace:Email
>>  Key:UserID
>> SuperColumn Entries:
>>Individual Email 1:  Columns {body, header, tags, recipients, flags, 
>> whatever}
>>Individual Email 2:  Columns {body, header, tags, recipients, flags, 
>> whatever}
>>Individual Email 3:  Columns {body, header, tags, recipients, flags, 
>> whatever}
>>
>> I think now this is probably the wrong concept.
>>
>> It is really more like:
>>Primary Key: Name:Value pairs
>>
>> And with Supercolumns, the Value part can be another Hash:
>>Primary Key: Name: {Name:Value pairs} pairs
>>
>> But when I lookup by Primary Key, ALL of the data associated with the key 
>> will be brought into memory!  So, when if I wanted to display the inbox of a 
>> user with several years of email, it would be one HUGE read to suck his 
>> entire inbox into memory to get down to the point I could display one 
>> message.
>>
>> Is this more correct?
>>
>> -Original Message-
>> From: Jonathan Ellis [mailto:jbel...@gmail.com]
>> Sent: Tuesday, April 20, 2010 10:47 AM
>> To: user@cassandra.apache.org
>> Subject: Re: How to increase cassandra's performance in read?
>>
>> How many columns are in the supercolumn total?
>>
>> "in super columnfamilies there is a third level of subcolumns; these
>> are not indexed, and any request for a subcolumn deserializes _all_
>> the subcolumns in that supercolumn"
>>
>> http://wiki.apache.org/cassandra/CassandraLimitations
>>
>> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones  wrote:
>>> I too am seeing very slow performance while testing worst case scenarios of
>>> 1 key leading to 1 supercolumn and 1 column beyond that.
>>>
>>>
>>>
>>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes)
>>>
>>>
>>>
>>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows.
>>> (With NO swapping)  So far, I've found nothing that helps, including
>>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing
>>> 

Re: Re: Modelling assets and user permissions

2010-04-20 Thread Vick Khera
On Tue, Apr 20, 2010 at 1:37 PM, tsuraan  wrote:
> The assets are binary files on a document tracking system.  Our
> current platform is postgres-backed; the entire system we've written
> is fairly easily distributed across multiple computers, but postgres
> isn't.  There are reliable databases that do scale out, but they tend
> to be a little on the pricey side...  Our current system works well in
> the tens to hundreds of millions of documents with hundreds of users,
> but we're hitting the billions of documents with thousands of users,
> so cassandra's scaling properties are pretty appealing there.

It seems to me you might get by with putting the actual assets into
cassandra (possibly breaking them up into chunks depending on how big
they are) and storing the pointers to them in Postgres along with all
the other metadata.  If it were me, I'd split each file into a fixed
chunksize and store it using its SHA1 checksum, and keep an ordered
list of chunks that make up a file, then never delete a chunk.  Given
billions of documents you just may end up with some savings due to
file chunks that are identical.

You could partition the postgres tables and replicate the data to a
handful of read-only nodes that could handle quite a bit of the work.
I suppose it depends on your write-frequency how that might pan out as
a scalability option.


Re: Filters

2010-04-20 Thread Christian Torres
So the sugestion would be create a column family with the values or states
and with columns save the matches?

On Tue, Apr 20, 2010 at 11:27 AM, Roger Schildmeijer  wrote:

> My bad. Missed your one-to-one relationship (row key <-> column
>
> )
> On 20 apr 2010, at 19.24em, Christian Torres wrote:
>
> Mmmm...
>
> According with this doc http://wiki.apache.org/cassandra/API#get_slicethat a 
> developer mailed to me It's possible!!
>
> I sent you as reference
>
> On Tue, Apr 20, 2010 at 11:17 AM, Mark Jones  wrote:
>
>>  You will have to pull the columns and filter yourself.
>>
>>
>> *From:* Christian Torres [mailto:chtor...@gmail.com]
>> *Sent:* Tuesday, April 20, 2010 11:50 AM
>> *To:* user@cassandra.apache.org
>> *Cc:* d...@cassandra.apache.org
>> *Subject:* Filters
>>
>>
>> Hello!
>>
>> Is there any way to make filters (WHEREs) in cassandra? Or I have tomanages 
>> to doit
>>
>> For example:
>>
>> I have a ColumnFamily with a column in each row whose value is a state...
>> Public or Private, so I want to filter all rows that are private and also
>> the public ones in other form... Beside in that rows I will have names of
>> persons and I'll need to filter by Initials or Complete Lastnames, etc.
>>
>> *So any idea?*
>>
>> Regards
>>
>> --
>> Christian Torres * Desarrollador Web * Guegue.com *
>> Celular: +505 84 65 92 62 * Loving of the Programming
>>
>
>
>
> --
> Christian Torres * Desarrollador Web * Guegue.com *
> Celular: +505 84 65 92 62 * Loving of the Programming
>
>
>


-- 
Christian Torres * Desarrollador Web * Guegue.com *
Celular: +505 84 65 92 62 * Loving of the Programming


Re: Filters

2010-04-20 Thread Christian Torres
And the key would be the state or value matched, I'm getting it well?

On Tue, Apr 20, 2010 at 2:46 PM, Christian Torres wrote:

> So the sugestion would be create a column family with the values or states
> and with columns save the matches?
>
>
> On Tue, Apr 20, 2010 at 11:27 AM, Roger Schildmeijer <
> schildmei...@gmail.com> wrote:
>
>> My bad. Missed your one-to-one relationship (row key <-> column
>>
>> )
>> On 20 apr 2010, at 19.24em, Christian Torres wrote:
>>
>> Mmmm...
>>
>> According with this doc http://wiki.apache.org/cassandra/API#get_slicethat a 
>> developer mailed to me It's possible!!
>>
>> I sent you as reference
>>
>> On Tue, Apr 20, 2010 at 11:17 AM, Mark Jones wrote:
>>
>>>  You will have to pull the columns and filter yourself.
>>>
>>>
>>> *From:* Christian Torres [mailto:chtor...@gmail.com]
>>> *Sent:* Tuesday, April 20, 2010 11:50 AM
>>> *To:* user@cassandra.apache.org
>>> *Cc:* d...@cassandra.apache.org
>>> *Subject:* Filters
>>>
>>>
>>> Hello!
>>>
>>> Is there any way to make filters (WHEREs) in cassandra? Or I have tomanages 
>>> to doit
>>>
>>> For example:
>>>
>>> I have a ColumnFamily with a column in each row whose value is a state...
>>> Public or Private, so I want to filter all rows that are private and also
>>> the public ones in other form... Beside in that rows I will have names of
>>> persons and I'll need to filter by Initials or Complete Lastnames, etc.
>>>
>>> *So any idea?*
>>>
>>> Regards
>>>
>>> --
>>> Christian Torres * Desarrollador Web * Guegue.com *
>>> Celular: +505 84 65 92 62 * Loving of the Programming
>>>
>>
>>
>>
>> --
>> Christian Torres * Desarrollador Web * Guegue.com *
>> Celular: +505 84 65 92 62 * Loving of the Programming
>>
>>
>>
>
>
> --
> Christian Torres * Desarrollador Web * Guegue.com *
> Celular: +505 84 65 92 62 * Loving of the Programming
>



-- 
Christian Torres * Desarrollador Web * Guegue.com *
Celular: +505 84 65 92 62 * Loving of the Programming


Re: Re: Modelling assets and user permissions

2010-04-20 Thread tsuraan
> It seems to me you might get by with putting the actual assets into
> cassandra (possibly breaking them up into chunks depending on how big
> they are) and storing the pointers to them in Postgres along with all
> the other metadata.  If it were me, I'd split each file into a fixed
> chunksize and store it using its SHA1 checksum, and keep an ordered
> list of chunks that make up a file, then never delete a chunk.  Given
> billions of documents you just may end up with some savings due to
> file chunks that are identical.

The retrieval of documents is pretty key (people like getting their
files), so we store them on disk and use our http server's static file
serving to send them out.  I'm not sure what the best way to serve
files stored in cassandra would be, but the free replication offered
is interesting.  Is cassandra a sane way to store huge amounts (many
TB) of raw data?  I saw in the limitations page that people are using
cassandra to store files, but is it considered a good idea?

> You could partition the postgres tables and replicate the data to a
> handful of read-only nodes that could handle quite a bit of the work.
> I suppose it depends on your write-frequency how that might pan out as
> a scalability option.

Our system is pretty write-heavy; we currently do a bit under a
million files a day (which translates to about 5x number of db records
stored), but we're going for a few million per day.

Here's a quick question that should be answerable:  If I have a CF
with SuperColumns where one of the SuperColumns has keys that are
users allowed to see an asset, is it guaranteed to be safe to add keys
to that SuperColumn?  I noticed that each column has its own
timestamp, so it doesn't look like I actually need to write a full row
(which would introduce overwriting race-condition concerns).  It looks
like I can just use batch_mutate to add the keys that I want to the
permissions SuperColumn.  Is that correct, and would that avoid races?


Using get_range_slices

2010-04-20 Thread Chris Dean
I'd like to use get_range_slices to pull all the keys from a small CF
with 10,000 keys.  I'd also like to get them in chunks of 100 at a time.
Is there a way to do that?

I thought I could set start_token and end_token in KeyRange, but I can't
figure out what the intial start_token should be.

Cheers,
Chris Dean


Re: Using get_range_slices

2010-04-20 Thread Jonathan Ellis
you should use keys, not tokens.  start with empty string.

On Tue, Apr 20, 2010 at 5:12 PM, Chris Dean  wrote:
> I'd like to use get_range_slices to pull all the keys from a small CF
> with 10,000 keys.  I'd also like to get them in chunks of 100 at a time.
> Is there a way to do that?
>
> I thought I could set start_token and end_token in KeyRange, but I can't
> figure out what the intial start_token should be.
>
> Cheers,
> Chris Dean
>


Big Data Workshop 4/23 was Re: Cassandra Hackathon in SF @ Digg - 04/22 6:30pm

2010-04-20 Thread Joseph Boyle
Reminder - price goes up after tonight at http://bigdataworkshop.eventbrite.com

We now have enough people interested in a bus or van from SF to Mountain View 
to offer one. Check the interested box when you register and we will send you 
pickup point information.

We will have people from the Cassandra (including Stu Hood and Matt Pfeil) and 
other NoSQL communities as well as with broader Big Data interests, all 
available for discussion, and you can propose a session to learn about anything.

On Apr 2, 2010, at 8:22 AM, Eric Evans wrote:

> On Thu, 2010-03-25 at 15:13 -0700, Chris Goffinet wrote:
>> As promised, here is the official invite to register for the hackathon
>> in SF. The event starts at 6:30pm on April 22nd. 
>> 
>> 
>> http://cassandrahackathon.eventbrite.com/
> 
> It looks like there is also a workshop on Big Data at the Computer
> History Museum the day after the hackathon
> (http://bigdataworkshop.com/).
> 
> How many people are interested in attending this as well?
> 
> -- 
> Eric Evans
> eev...@rackspace.com
> 



Re: How to increase cassandra's performance in read?

2010-04-20 Thread Benjamin Black
On Tue, Apr 20, 2010 at 11:54 AM, Mark Jones  wrote:
> When I look at this arrangement, I see one lookup by key for the user, 
> followed by a large read for all the "email indexes"  (these are all columns 
> in the same row, right?)
>
> Then one lookup by key for each email  Seems very seek intensive.
>

Do you need to grab every single email every single time?  Seems to me
you only need the recent ones or a page full.  A single multiget would
do it, and the load is spread across the cluster.

>...
>
>
> Ok, so If I do it this way, the # of keys rapidly goes into the billions, 
> does that not cause other problems?

Not generally.  Cassandra is built to handle enormous numbers of rows
efficiently.

>Seems like many more data/index files
>

Only if you aren't compacting for some reason.


b


TimeoutException when I put very large value

2010-04-20 Thread Jeff Zhang
Hi all,

When I insert very large value, the thrift will throw TimeOutException,
event If I set the socket timeout as 10 minutes.  I believe the 10 minutes
is enough for inserting the large value and spreading the replica to other
machines, the ConsistencyLevel I choose is DCQUORUM. So is there any way I
can use to resolve this problem, what parameter I can use to tune the
program ?

Thanks

-- 
Best Regards

Jeff Zhang


Re: TimeoutException when I put very large value

2010-04-20 Thread Ryan King
what's your RPC timeout in storage-conf?

-ryan

On Tue, Apr 20, 2010 at 6:46 PM, Jeff Zhang  wrote:
> Hi all,
>
> When I insert very large value, the thrift will throw TimeOutException,
> event If I set the socket timeout as 10 minutes.  I believe the 10 minutes
> is enough for inserting the large value and spreading the replica to other
> machines, the ConsistencyLevel I choose is DCQUORUM. So is there any way I
> can use to resolve this problem, what parameter I can use to tune the
> program ?
>
> Thanks
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: TimeoutException when I put very large value

2010-04-20 Thread acrd seek
Thanks Ryan, I also notice this prameter in storage-conf just now. I am
going to increase this number to test whether it will work



2010/4/21 Ryan King 

> what's your RPC timeout in storage-conf?
>
> -ryan
>
> On Tue, Apr 20, 2010 at 6:46 PM, Jeff Zhang  wrote:
> > Hi all,
> >
> > When I insert very large value, the thrift will throw TimeOutException,
> > event If I set the socket timeout as 10 minutes.  I believe the 10
> minutes
> > is enough for inserting the large value and spreading the replica to
> other
> > machines, the ConsistencyLevel I choose is DCQUORUM. So is there any way
> I
> > can use to resolve this problem, what parameter I can use to tune the
> > program ?
> >
> > Thanks
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
>


Batch row deletion

2010-04-20 Thread Carlos Sanchez
All,

Is there or will there be a feature to batch delete rows? (KeyRange delete?)

Thanks

Carlos

This email message and any attachments are for the sole use of the intended 
recipients and may contain proprietary and/or confidential information which 
may be privileged or otherwise protected from disclosure. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not an 
intended recipient, please contact the sender by reply email and destroy the 
original message and any copies of the message as well as any attachments to 
the original message.


Re: Batch row deletion

2010-04-20 Thread Jonathan Ellis
This will be done in https://issues.apache.org/jira/browse/CASSANDRA-293

On Tue, Apr 20, 2010 at 10:45 PM, Carlos Sanchez
 wrote:
> All,
>
> Is there or will there be a feature to batch delete rows? (KeyRange delete?)
>
> Thanks
>
> Carlos
>
> This email message and any attachments are for the sole use of the intended 
> recipients and may contain proprietary and/or confidential information which 
> may be privileged or otherwise protected from disclosure. Any unauthorized 
> review, use, disclosure or distribution is prohibited. If you are not an 
> intended recipient, please contact the sender by reply email and destroy the 
> original message and any copies of the message as well as any attachments to 
> the original message.
>


RE: Batch row deletion

2010-04-20 Thread Carlos Sanchez
Awesome thx..

Carlos

From: Jonathan Ellis [jbel...@gmail.com]
Sent: Tuesday, April 20, 2010 10:52 PM
To: user@cassandra.apache.org
Subject: Re: Batch row deletion

This will be done in https://issues.apache.org/jira/browse/CASSANDRA-293

On Tue, Apr 20, 2010 at 10:45 PM, Carlos Sanchez
 wrote:
> All,
>
> Is there or will there be a feature to batch delete rows? (KeyRange delete?)
>
> Thanks
>
> Carlos
>
> This email message and any attachments are for the sole use of the intended 
> recipients and may contain proprietary and/or confidential information which 
> may be privileged or otherwise protected from disclosure. Any unauthorized 
> review, use, disclosure or distribution is prohibited. If you are not an 
> intended recipient, please contact the sender by reply email and destroy the 
> original message and any copies of the message as well as any attachments to 
> the original message.
>

This email message and any attachments are for the sole use of the intended 
recipients and may contain proprietary and/or confidential information which 
may be privileged or otherwise protected from disclosure. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not an 
intended recipient, please contact the sender by reply email and destroy the 
original message and any copies of the message as well as any attachments to 
the original message.


new hector version and updates

2010-04-20 Thread Ran Tavory
A few recent changes made at hector:
1. We keep several branches in parallel: 0.5.0, 0.5.1, 0.6.0 and master.
We've now changed master to be at version 0.6.0. 0.6.1 is compatible with
0.6.0 as the API didn't change, so practically master is now at the latest
released cassandra version.
2. We added a batchMutate call to the API (thanks Nathan). Addition of
get_range_slices is still pending a volunteer that actually needs it ;)
http://github.com/rantav/hector/issues/#issue/22
3. Uploaded a new zip to the
downloadsection with all
the up to date 0.6.0 work:
http://github.com/downloads/rantav/hector/hector-0.6.0-11.zip

Finally, if you use hector or considering please subscribe to the
googlegroup and post your questions there
http://groups.google.com/group/hector-users