I like to base my batch sizes off of the total number of columns
instead of the number of rows. This effectively means counting the
number of Mutation objects in your mutation map and submitting the
batch once it reaches a certain size. For my data, batch sizes of
about 25,000 columns work best. Yo
t the bottleneck.
On Tue, May 11, 2010 at 8:31 AM, David Boxenhorn wrote:
> Thanks a lot! 25,000 is a number I can work with.
>
> Any other suggestions?
>
> On Tue, May 11, 2010 at 3:21 PM, Ben Browning wrote:
>>
>> I like to base my batch sizes off of the total num
Maxim,
Check out the getLocation() method from this file:
http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java
Basically, it loops over the list of nodes containing this split of
data and if any of them are the local node, it returns
The keys will not be in any specific order when not using OPP, so, you
will never "get out of range" - you have to iterate over every single
key to find all keys that start with "CATEGORY". If you don't iterate
over every single key you run a chance of missing some. Obviously,
this kind of key rang
Martin,
On Wed, Jun 2, 2010 at 8:34 AM, Dr. Martin Grabmüller
wrote:
> I think you can specify an end key, but it should be a key which does exist
> in your column family.
Logically, it doesn't make sense to ever specify an end key with
random partitioner. If you specified a start key of "aaa"
They exist because when using OPP they are useful and make sense.
On Wed, Jun 2, 2010 at 8:59 AM, David Boxenhorn wrote:
> So why do the "start" and "finish" range parameters exist?
>
> On Wed, Jun 2, 2010 at 3:53 PM, Ben Browning wrote:
>>
>> Martin,
I like to model this kind of data as columns, where the timestamps are
the column name (either longs, TimeUUIDs, or string depending on your
usage). If you have too much data for a single row, you'd need to have
multiple rows of these. For time-series data, it makes sense to use
one row per minute/
With a traffic pattern like that, you may be better off storing the
events of each burst (I'll call them group) in one or more keys and
then storing these keys in the day key.
EventGroupsPerDay: {
"20100601": {
123456789: "group123", // column name is timestamp group was
received, column val
How many subcolumns are in each supercolumn and how large are the
values? Your example shows 8 subcolumns, but I didn't know if that was
the actual number. I've been able to read columns out of Cassandra at
an order of magnitude higher than what you're seeing here but there
are too many variables t
> So why do the "start" and "finish" range parameters exist?
>>
>> On Wed, Jun 2, 2010 at 3:53 PM, Ben Browning wrote:
>>>
>>> Martin,
>>>
>>> On Wed, Jun 2, 2010 at 8:34 AM, Dr. Martin Grabmüller
>>> wrote:
>>>
There really aren't "seed nodes" in a Cassandra cluster. When you
specify a seed in a node's configuration it's just a way to let it
know how to find the other nodes in the cluster. A node functions the
same whether it is another node's seed or not. In other words, all of
the nodes in a cluster are
11 matches
Mail list logo