Where you’ll run into trouble is with compaction.  It looks as if pord is some 
sequentially increasing value.  Try your experiment again with a clustering key 
which is a bit more random at the time of insertion.


On Dec 10, 2013, at 5:41 AM, Robert Wille <rwi...@fold3.com> wrote:

> I have a question about this statement:
> 
> When rows get above a few 10’s  of MB things can slow down, when they get 
> above 50 MB they can be a pain, when they get above 100MB it’s a warning 
> sign. And when they get above 1GB, well you you don’t want to know what 
> happens then. 
> 
> I tested a data model that I created. Here’s the schema for the table in 
> question:
> 
> CREATE TABLE bdn_index_pub (
>       tree INT,
>       pord INT,
>       hpath VARCHAR,
>       PRIMARY KEY (tree, pord)
> );
> 
> As a test, I inserted 100 million records. tree had the same value for every 
> record, and I had 100 million values for pord. hpath averaged about 50 
> characters in length. My understanding is that all 100 million strings would 
> have been stored in a single row, since they all had the same value for the 
> first component of the primary key. I didn’t look at the size of the table, 
> but it had to be several gigs (uncompressed). Contrary to what Aaron says, I 
> do want to know what happens, because I didn’t experience any issues with 
> this table during my test. Inserting was fast. The last batch of records 
> inserted in approximately the same amount of time as the first batch. 
> Querying the table was fast. What I didn’t do was test the table under load, 
> nor did I try this in a multi-node cluster.
> 
> If this is bad, can somebody suggest a better pattern? This table was 
> designed to support a query like this: select hpath from bdn_index_pub where 
> tree = :tree and pord >= :start and pord <= :end. In my application, most 
> trees will have less than a million records. A handful will have 10’s of 
> millions, and one of them will have 100 million.
> 
> If I need to break up my rows, my first instinct would be to divide each tree 
> into blocks of say 10,000 and change tree to a string that contains the tree 
> and the block number. Something like this:
> 
> 17:0, 0, ‘/’
> …
> 17:0, 9999, ’/a/b/c’
> 17:1,10000, ‘/a/b/d’
> …
> 
> I’d then need to issue an extra query for ranges that crossed block 
> boundaries.
> 
> Any suggestions on a better pattern?
> 
> Thanks
> 
> Robert
> 
> From: Aaron Morton <aa...@thelastpickle.com>
> Reply-To: <user@cassandra.apache.org>
> Date: Tuesday, December 10, 2013 at 12:33 AM
> To: Cassandra User <user@cassandra.apache.org>
> Subject: Re: Exactly one wide row per node for a given CF?
> 
>>> But this becomes troublesome if I add or remove nodes. What effectively I 
>>> want is to partition on the unique id of the record modulus N (id % N; 
>>> where N is the number of nodes).
> This is exactly the problem consistent hashing (used by cassandra) is 
> designed to solve. If you hash the key and modulo the number of nodes, adding 
> and removing nodes requires a lot of data to move. 
> 
>>> I want to be able to randomly distribute a large set of records but keep 
>>> them clustered in one wide row per node.
> Sounds like you should revisit your data modelling, this is a pretty well 
> known anti pattern. 
> 
> When rows get above a few 10’s  of MB things can slow down, when they get 
> above 50 MB they can be a pain, when they get above 100MB it’s a warning 
> sign. And when they get above 1GB, well you you don’t want to know what 
> happens then. 
> 
> It’s a bad idea and you should take another look at the data model. If you 
> have to do it, you can try the ByteOrderedPartitioner which uses the row key 
> as a token, given you total control of the row placement. 
> 
> Cheers
> 
> 
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
> 
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
> 
> On 4/12/2013, at 8:32 pm, Vivek Mishra <mishra.v...@gmail.com> wrote:
> 
>> So Basically you want to create a cluster of multiple unique keys, but data 
>> which belongs to one unique should be colocated. correct?
>> 
>> -Vivek
>> 
>> 
>> On Tue, Dec 3, 2013 at 10:39 AM, onlinespending <onlinespend...@gmail.com> 
>> wrote:
>>> Subject says it all. I want to be able to randomly distribute a large set 
>>> of records but keep them clustered in one wide row per node.
>>> 
>>> As an example, lets say I’ve got a collection of about 1 million records 
>>> each with a unique id. If I just go ahead and set the primary key (and 
>>> therefore the partition key) as the unique id, I’ll get very good random 
>>> distribution across my server cluster. However, each record will be its own 
>>> row. I’d like to have each record belong to one large wide row (per server 
>>> node) so I can have them sorted or clustered on some other column.
>>> 
>>> If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 
>>> 5 at the time of creation and have the partition key set to this value. But 
>>> this becomes troublesome if I add or remove nodes. What effectively I want 
>>> is to partition on the unique id of the record modulus N (id % N; where N 
>>> is the number of nodes).
>>> 
>>> I have to imagine there’s a mechanism in Cassandra to simply randomize the 
>>> partitioning without even using a key (and then clustering on some column).
>>> 
>>> Thanks for any help.
>> 
> 

Reply via email to