Do we also need to consider the client API? If we don't adjust thrift, the client just gets bytes right? The client is on their own to marshal back into a structure. In this case, it seems like we would want to chose a standard that is efficient and for which there are common libraries. Protobuf seems to fit the bill here.
Or do we pass back some other structure? (Native lists/maps? JSON strings?) Do we ignore sorting/comparators? (similar to SOLR, I'm not sure people have defined a good sort for multi-valued items) -brian ---- Brian O'Neill Lead Architect, Software Development Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406 p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/ On 3/30/12 12:01 PM, "Daniel Doubleday" <daniel.double...@gmx.net> wrote: >> Just telling C* to store a byte[] *will* be slightly lighter-weight >> than giving it named columns, but we're talking negligible compared to >> the overhead of actually moving the data on or off disk in the first >> place. >Hm - but isn't this exactly the point? You don't want to move data off >disk. >But decomposing into columns will lead to more of that: > >- Total amount of serialized data is (in most cases a lot) larger than >protobuffed / compressed version >- If you do selective updates the document will be scattered over >multiple ssts plus if you do sliced reads you can't optimize reads as >opposed to the single column version that when updated is automatically >superseding older versions so most reads will hit only one sst > >All these reads make the hot dataset. If it fits the page cache your >fine. If it doesn't you need to buy more iron. > >Really could not resist because your statement seems to be contrary to >all our tests / learnings. > >Cheers, >Daniel > >From dev list: > >Re: Document storage >On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian <d...@venarc.com> wrote: >>> I think this is a much better approach because that gives you the >>> ability to update or retrieve just parts of objects efficiently, >>> rather than making column values just blobs with a bunch of special >>> case logic to introspect them. Which feels like a big step backwards >>> to me. >> >> Unless your access pattern involves reading/writing the whole document >>each time. In >that case you're better off serializing the whole document and storing it >in a column as a >byte[] without incurring the overhead of column indexes. Right? > >Hmm, not sure what you're thinking of there. > >If you mean the "index" that's part of the row header for random >access within a row, then no, serializing to byte[] doesn't save you >anything. > >If you mean secondary indexes, don't declare any if you don't want any. :) > >Just telling C* to store a byte[] *will* be slightly lighter-weight >than giving it named columns, but we're talking negligible compared to >the overhead of actually moving the data on or off disk in the first >place. Not even close to being worth giving up being able to deal >with your data from standard tools like cqlsh, IMO. > >-- >Jonathan Ellis >Project Chair, Apache Cassandra >co-founder of DataStax, the source for professional Cassandra support >http://www.datastax.com >