I have a question about this statement:

When rows get above a few 10¹s  of MB things can slow down, when they get
above 50 MB they can be a pain, when they get above 100MB it¹s a warning
sign. And when they get above 1GB, well you you don¹t want to know what
happens then. 

I tested a data model that I created. Here¹s the schema for the table in
question:

CREATE TABLE bdn_index_pub (

tree INT,

pord INT,

hpath VARCHAR,

PRIMARY KEY (tree, pord)

);


As a test, I inserted 100 million records. tree had the same value for every
record, and I had 100 million values for pord. hpath averaged about 50
characters in length. My understanding is that all 100 million strings would
have been stored in a single row, since they all had the same value for the
first component of the primary key. I didn¹t look at the size of the table,
but it had to be several gigs (uncompressed). Contrary to what Aaron says, I
do want to know what happens, because I didn¹t experience any issues with
this table during my test. Inserting was fast. The last batch of records
inserted in approximately the same amount of time as the first batch.
Querying the table was fast. What I didn¹t do was test the table under load,
nor did I try this in a multi-node cluster.

If this is bad, can somebody suggest a better pattern? This table was
designed to support a query like this: select hpath from bdn_index_pub where
tree = :tree and pord >= :start and pord <= :end. In my application, most
trees will have less than a million records. A handful will have 10¹s of
millions, and one of them will have 100 million.

If I need to break up my rows, my first instinct would be to divide each
tree into blocks of say 10,000 and change tree to a string that contains the
tree and the block number. Something like this:

17:0, 0, Œ/¹
Š
17:0, 9999, ¹/a/b/c¹
17:1,10000, Œ/a/b/d¹
Š

I¹d then need to issue an extra query for ranges that crossed block
boundaries.

Any suggestions on a better pattern?

Thanks

Robert

From:  Aaron Morton <aa...@thelastpickle.com>
Reply-To:  <user@cassandra.apache.org>
Date:  Tuesday, December 10, 2013 at 12:33 AM
To:  Cassandra User <user@cassandra.apache.org>
Subject:  Re: Exactly one wide row per node for a given CF?

>> But this becomes troublesome if I add or remove nodes. What effectively I
>> want is to partition on the unique id of the record modulus N (id % N; where
>> N is the number of nodes).
This is exactly the problem consistent hashing (used by cassandra) is
designed to solve. If you hash the key and modulo the number of nodes,
adding and removing nodes requires a lot of data to move.

>> I want to be able to randomly distribute a large set of records but keep them
>> clustered in one wide row per node.
Sounds like you should revisit your data modelling, this is a pretty well
known anti pattern.

When rows get above a few 10¹s  of MB things can slow down, when they get
above 50 MB they can be a pain, when they get above 100MB it¹s a warning
sign. And when they get above 1GB, well you you don¹t want to know what
happens then. 

It¹s a bad idea and you should take another look at the data model. If you
have to do it, you can try the ByteOrderedPartitioner which uses the row key
as a token, given you total control of the row placement.

Cheers


-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 4/12/2013, at 8:32 pm, Vivek Mishra <mishra.v...@gmail.com> wrote:

> So Basically you want to create a cluster of multiple unique keys, but data
> which belongs to one unique should be colocated. correct?
> 
> -Vivek
> 
> 
> On Tue, Dec 3, 2013 at 10:39 AM, onlinespending <onlinespend...@gmail.com>
> wrote:
>> Subject says it all. I want to be able to randomly distribute a large set of
>> records but keep them clustered in one wide row per node.
>> 
>> As an example, lets say I¹ve got a collection of about 1 million records each
>> with a unique id. If I just go ahead and set the primary key (and therefore
>> the partition key) as the unique id, I¹ll get very good random distribution
>> across my server cluster. However, each record will be its own row. I¹d like
>> to have each record belong to one large wide row (per server node) so I can
>> have them sorted or clustered on some other column.
>> 
>> If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 5
>> at the time of creation and have the partition key set to this value. But
>> this becomes troublesome if I add or remove nodes. What effectively I want is
>> to partition on the unique id of the record modulus N (id % N; where N is the
>> number of nodes).
>> 
>> I have to imagine there¹s a mechanism in Cassandra to simply randomize the
>> partitioning without even using a key (and then clustering on some column).
>> 
>> Thanks for any help.
> 



Reply via email to