In the twissandra example, http://www.riptano.com/docs/0.6/data_model/twissandra#adding-friends , I find that they have split the materialized view of a user's homepage (like his followers list, tweets from friends) into several columnfamilies instead of putting in supercolumns inside a single SupercolumnFamily thereby making the rows skinnier, I was wandering as to which one will give better performance in terms of reads. I think skinnier will definitely have the advantage of less row mutations thus good read performance, when, only they, need to be retrieved, plus supercolumns of followerlist ,etc are avoided(this sounds good as supercolumn indexing limitations will not suck), but I still not pretty sure whether it would beneficial in terms of performance numbers, if I split the materialized view of single user into several columnfamilies instead of single row in single Supercolumnfamily.
On Sat, Jan 8, 2011 at 2:05 AM, Rajkumar Gupta <rajkumar....@gmail.com> wrote: > The fact that subcolumns inside the supercolumns aren't indexed > currently may suck here, whenever a small no (10-20 ) of subcolumns > need to be retreived from a large list of subcolumns of a supercolumn > like MyPostsIdKeysList. > > On Fri, Jan 7, 2011 at 9:58 PM, Raj <rajkumar....@gmail.com> wrote: >> My question is in context of a social network schema design >> >> I am thinking of following schema for storing a user's data that is >> required as he logs in & is led to his homepage:- >> (I aimed at a schema design such that through a single row read query >> all the data that would be required to put up the homepage of that >> user, is retreived.) >> >> UserSuperColumnFamily: { // Column Family >> >> UserIDKey: >> {columns: MyName, MyEmail, MyCity,...etc >> supercolumns: MyFollowersList, MyFollowiesList, MyPostsIdKeysList, >> MyInterestsList, MyAlbumsIdKeysList, MyVideoIdKeysList, >> RecentNotificationsForUserList, MessagesReceivedList, >> MessagesSentList, AccountSettingsList, RecentSelfActivityList, >> UpdatesFromFollowiesList >> } >> } >> >> Thus user's newfeed would be generated using superColumn: >> UpdatesFromFollowiesList. But the UpdatesFromFollowiesList, would >> obviously contain only Id of the posts and not the entire post data. >> >> Questions: >> >> 1.) What could be the problems with this design, any improvements ? >> >> 2.) Would frequent & heavy overwrite operations/ row mutations (for >> example; when propagating the post updates for news-feed from some >> user to all his followies) which leads to rows ultimately being in >> several SSTables, will lead to degraded read performance ?? Is it >> suitable to use row Cache(too big row but all data required uptil user >> is logged in) If I do not use cache, it may be very expensive to pull >> the row each time a data is required for the given user since row >> would be in several sstables. How can I improve the >> read performance here >> >> The actual data of the posts from network would be retrieved using >> PostIdKey through subsequent read queries from columnFamily >> PostsSuperColumnFamily which would be like follows: >> >> PostsSuperColumnFamily:{ >> >> PostIdKey: >> { >> columns: PostOwnerId, PostBody >> supercolumns: TagsForPost {list of columns of all tags for the >> post}, PeopleWhoLikedThisPost {list of columns of UserIdKey of all the >> likers} >> } >> } >> >> Is this the best design to go with or are there any issues to consider >> here ? Thanks in anticipation of your valuable comments.! >> >