Re: What is the optimal size of batch mutate batches?

2010-05-11 Thread Ben Browning
I like to base my batch sizes off of the total number of columns instead of the number of rows. This effectively means counting the number of Mutation objects in your mutation map and submitting the batch once it reaches a certain size. For my data, batch sizes of about 25,000 columns work best. Yo

Re: What is the optimal size of batch mutate batches?

2010-05-11 Thread Ben Browning
t the bottleneck. On Tue, May 11, 2010 at 8:31 AM, David Boxenhorn wrote: > Thanks a lot! 25,000 is a number I can work with. > > Any other suggestions? > > On Tue, May 11, 2010 at 3:21 PM, Ben Browning wrote: >> >> I like to base my batch sizes off of the total num

Re: Hadoop over Cassandra

2010-05-18 Thread Ben Browning
Maxim, Check out the getLocation() method from this file: http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java Basically, it loops over the list of nodes containing this split of data and if any of them are the local node, it returns

Re: Range search on keys not working?

2010-06-02 Thread Ben Browning
The keys will not be in any specific order when not using OPP, so, you will never "get out of range" - you have to iterate over every single key to find all keys that start with "CATEGORY". If you don't iterate over every single key you run a chance of missing some. Obviously, this kind of key rang

Re: Range search on keys not working?

2010-06-02 Thread Ben Browning
Martin, On Wed, Jun 2, 2010 at 8:34 AM, Dr. Martin Grabmüller wrote: > I think you can specify an end key, but it should be a key which does exist > in your column family. Logically, it doesn't make sense to ever specify an end key with random partitioner. If you specified a start key of "aaa"

Re: Range search on keys not working?

2010-06-02 Thread Ben Browning
They exist because when using OPP they are useful and make sense. On Wed, Jun 2, 2010 at 8:59 AM, David Boxenhorn wrote: > So why do the "start" and "finish" range parameters exist? > > On Wed, Jun 2, 2010 at 3:53 PM, Ben Browning wrote: >> >> Martin,

Re: Giant sets of ordered data

2010-06-02 Thread Ben Browning
I like to model this kind of data as columns, where the timestamps are the column name (either longs, TimeUUIDs, or string depending on your usage). If you have too much data for a single row, you'd need to have multiple rows of these. For time-series data, it makes sense to use one row per minute/

Re: Giant sets of ordered data

2010-06-02 Thread Ben Browning
With a traffic pattern like that, you may be better off storing the events of each burst (I'll call them group) in one or more keys and then storing these keys in the day key. EventGroupsPerDay: { "20100601": { 123456789: "group123", // column name is timestamp group was received, column val

Re: Are 6..8 seconds to read 23.000 small rows - as it should be?

2010-06-04 Thread Ben Browning
How many subcolumns are in each supercolumn and how large are the values? Your example shows 8 subcolumns, but I didn't know if that was the actual number. I've been able to read columns out of Cassandra at an order of magnitude higher than what you're seeing here but there are too many variables t

Re: Re: Range search on keys not working?

2010-06-09 Thread Ben Browning
> So why do the "start" and "finish" range parameters exist? >> >> On Wed, Jun 2, 2010 at 3:53 PM, Ben Browning wrote: >>> >>> Martin, >>> >>> On Wed, Jun 2, 2010 at 8:34 AM, Dr. Martin Grabmüller >>> wrote: >>>

Re: Seeds and AutoBoostrap

2010-06-09 Thread Ben Browning
There really aren't "seed nodes" in a Cassandra cluster. When you specify a seed in a node's configuration it's just a way to let it know how to find the other nodes in the cluster. A node functions the same whether it is another node's seed or not. In other words, all of the nodes in a cluster are