On Mon, Feb 7, 2011 at 5:40 AM, Aditya Narayan <ady...@gmail.com> wrote: > Thanks for the detailed explanation Peter! Definitely cleared my doubts ! > > > > On Mon, Feb 7, 2011 at 1:52 PM, Peter Schuller > <peter.schul...@infidyne.com> wrote: >>> Does huge variation in no. of columns in rows, over the column family >>> has *any* impact on the performance ? >>> >>> Can I have like just 100 columns in some rows and like hundred >>> thousands of columns in another set of rows, without any downsides ? >> >> If I interpret your question the way I think you mean it, then no, >> Cassandra doesn't "do" anything with the data such that the smaller >> rows are somehow directly less efficient because there are other rows >> that are bigger. It doesn't affect the on-disk format or the on-disk >> efficiency of accessing the rows. >> >> However, there are almost always indirect effects when it comes to >> performance, in and particular storage systems. In the case of >> Cassandra, the *variation* itself should not impose a direct >> performance penalty, but there are potential other effects. For >> example the row cache is only useful for small works, so if you are >> looking to use the row cache the huge rows would perhaps prevent that. >> This could be interpreted as a performance impact on the smaller rows >> by the larger rows.... Compaction may become more expensive due to >> e.g. additional GC pressure resulting from >> large-but-still-within-in-memory-limits rows being compacted (or not, >> depending on JVM/GC settings). There is also the effect of cache >> locality as data set grows, and the cache locality for the smaller >> rows will likely be worse than had they been in e.g. a separate CF. >> >> Those are just three random example; I'm just trying to make the point >> that "without any downsides" is a very strong and blanket requirement >> for making the decision to mix small rows with larger ones. >> >> -- >> / Peter Schuller >> >
The performance could be variable if you are using operations such as a get_slice with a large Slice Predicate, large rows take longer to be de serialized and transferred then smaller rows. I have never benchmarked this but it would probably take a significant difference in row size before the size of a row had a noticeable impact.