I can't answer for its sanity, but I would not do it that way. I'd have a CF for Emails, with 1 email per row, and another CF for UserEmails with per-user index rows referencing the Emails rows.
b On Tue, Apr 20, 2010 at 9:44 AM, Mark Jones <mjo...@imagehawk.com> wrote: > To make sure I'm clear on what you are saying: > > Are the "Individual Emails" in the example below, Supercolumns and the > {body, header, tags...} the subcolumns? > > Is that a sane data layout for an email system? Where the Supercolumn > identifier is the "conversation label" > > Sorry to be so daft, but the way columns and rows are bandied about in NoSQL > is a bit confusing when you are coming from a SQL background. I can't see > why you would want multiple emails in the same row since they each have the > same "columns" of information and therefore make good logical entities as > outlined below. > > -----Original Message----- > From: Jonathan Ellis [mailto:jbel...@gmail.com] > Sent: Tuesday, April 20, 2010 11:16 AM > To: user@cassandra.apache.org > Subject: Re: How to increase cassandra's performance in read? > > Not all the data associated w/ the key is brought into memory, just > all the data associated w/ the supercolumns being queried. > > Supercolumns are so you can update a smallish number of subcolumns > independently (e.g. when denormalizing an entire narrow row, usually > with a finite set of columns). If you want lots of subcolumns you > need to turn that supercolumn into a new row. > > On Tue, Apr 20, 2010 at 11:08 AM, Mark Jones <mjo...@imagehawk.com> wrote: >> When I first read this, it bothered me because it seemed like it couldn't be >> so. So I read the link, and it says the whole thing, so I have to ask for >> some classification here. >> >> I had always assumed a super column was similar to a local keyspace, and >> that the SubColumns under it were similar to keys, that way you could >> localize the data for a user or a website. >> >> So Keyspace:Email >> Key:UserID >> SuperColumn Entries: >> Individual Email 1: Columns {body, header, tags, recipients, flags, >> whatever} >> Individual Email 2: Columns {body, header, tags, recipients, flags, >> whatever} >> Individual Email 3: Columns {body, header, tags, recipients, flags, >> whatever} >> >> I think now this is probably the wrong concept. >> >> It is really more like: >> Primary Key: Name:Value pairs >> >> And with Supercolumns, the Value part can be another Hash: >> Primary Key: Name: {Name:Value pairs} pairs >> >> But when I lookup by Primary Key, ALL of the data associated with the key >> will be brought into memory! So, when if I wanted to display the inbox of a >> user with several years of email, it would be one HUGE read to suck his >> entire inbox into memory to get down to the point I could display one >> message. >> >> Is this more correct? >> >> -----Original Message----- >> From: Jonathan Ellis [mailto:jbel...@gmail.com] >> Sent: Tuesday, April 20, 2010 10:47 AM >> To: user@cassandra.apache.org >> Subject: Re: How to increase cassandra's performance in read? >> >> How many columns are in the supercolumn total? >> >> "in super columnfamilies there is a third level of subcolumns; these >> are not indexed, and any request for a subcolumn deserializes _all_ >> the subcolumns in that supercolumn" >> >> http://wiki.apache.org/cassandra/CassandraLimitations >> >> On Tue, Apr 20, 2010 at 9:50 AM, Mark Jones <mjo...@imagehawk.com> wrote: >>> I too am seeing very slow performance while testing worst case scenarios of >>> 1 key leading to 1 supercolumn and 1 column beyond that. >>> >>> >>> >>> Key -> SuperColumn -> 1 Column (of ~ 500 bytes) >>> >>> >>> >>> Drive utilization is 80-90% and I'm only dealing with 50-70 million rows. >>> (With NO swapping) So far, I've found nothing that helps, including >>> increasing the keycache FROM 200k-500k keys, I'm guessing the hashing >>> prevents better cache performance. >>> >>> >>> >>> Read performance is definitely not 3 IOs based on the utilization factors on >>> my drives. I'm not sure the issue was ever settled in the previous e-mails >>> as to how to calculate how many IOs were being done for each read. I've >>> been testing with clusters of 1,2,3 or 4 machines and so far all I'm seeing >>> with multiple machines, is lower performance in a cluster than alone. I >>> keep assuming that at some number of nodes, the performance will begin to >>> pick up. Three of my nodes are running with 8GB (6GB Java Heap), and one >>> has 4GB (3GB Java Heap). The machine with the smallest memory footprint is >>> the fastest performer on inserts, but definitely not the fastest on reads. >>> >>> >>> >>> I'm suspecting the read path is relying heavily on the fact that you want to >>> get many columns that are closely related, because lookup by key appears to >>> be incredibly slow. >>> >>> >>> >>> From: yangfeng [mailto:yea...@gmail.com] >>> Sent: Tuesday, April 20, 2010 7:59 AM >>> To: user@cassandra.apache.org; d...@cassandra.apache.org >>> Subject: How to increase cassandra's performance in read? >>> >>> >>> >>> I get 10 columns Family by keys and one columns Family has 30 columns. >>> >>> I use multigetSlice once to get 10 column Family.but the performance is so >>> poor. >>> >>> anyone has other thought to increase the performance. >>> >>> >> >